A MACHINE LEARNING APPROACH FOR GENE EXPRESSION ANALYSIS AND APPLICATIONS by THANH NGOC LE B.S., University of Econimics in Hochiminh City, Vietnam, 1994 M.S., National University in Hochiminh City, Vietnam, 2000 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science and Information Systems 2013 This thesis for the Doctor of Philosophy degree by Thanh Ngoc Le has been approved for the Computer Science and Information Systems Program by Gita Alaghband, Chair Tom Altman, Advisor Katheleen Gardiner, Co- Advisor James Gerlach Sonia Leach Boris Stilman Date: April 10, 2013 Le, Thanh, Ngoc (Ph.D., Computer Science and Information Systems) A Machine Learning approach for Gene Expression analysis and applications Thesis directed by Professor Tom Altman ABSTRACT High-throughput microarray technology is an important and revolutionary technique used in genomics and systems biology to analyze the expression of thousands of genes simultaneously. The popular use of this technique has resulted in enormous repositories of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Information (NCBI). However, an effective approach to optimally exploit these datasets in support of specific biological studies is still lacking. Specifically, an improved method is required to integrate data from multiple sources and to select only those datasets that meet an investigator's interest. In addition, to exploit the full power of microarray data, an effective method is required to determine the relationships among genes in the selected datasets and to interpret the biological meanings behind these relationships. To address these requirements, we have developed a machine learning based approach that includes: • An effective meta-analysis method to integrate microarray data from multiple sources; the method exploits information regarding the biological context of interest provided by the biologists. in • A novel and effective cluster analysis method to identify hidden patterns in selected data representing relationships between genes under the biological conditions of interest. • A novel motif finding method that discovers, not only the common transcription factor binding sites of co-regulated genes, but also the miRNA binding sites associated with the biological conditions. • A machine learning-based framework for microarray data analysis with a web application to run common analysis tasks on online. The form and content of this abstract are approved. I recommend its publication. Approved: Tom Altman IV DEDICATION I dedicate this work to my family, my wife, Lan, and my beloved daughter, Anh for their constant support and unconditional love. I love you all dearly. ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Tom Altman for his guidance and support throughout my graduate studies. His way of guidance helped me to develop my self- confidence and skills necessary to address not only research problems, but also real- world problems. Thank you very much Dr. Altman. I would like to express my sincere gratitude to my co-advisor, Dr. Katheleen Gardiner for her continuous support on my research work throughout the years. She guided me with remarkable patience and understanding. She was always there to help me whenever I encounter an obstacle or require an advice. Her inputs were very useful. It was really a pleasure working with her. I would like to thank Dr. Sonia Leach for her help in solving difficult problems and for her valuable inputs as a member of my dissertation committee. She was very generous in transferring her vast knowledge to us. Also, I would like to thank Dr. Gita Alaghband, Dr. James Gerlach, and Dr. Boris Stilman for serving on my dissertation committee and for their helpful suggestions and comments for the improvement of this project; your discussion, ideas, and feedback have been absolutely invaluable. I would like to thank you all for giving me an example of excellence as a researcher, mentor, instructor, and role model. I would like to thank my fellow graduate students, my lab mates and everyone who helped me in many ways to make my graduate student life a memorable one. I am very grateful to all of you. VI I would especially like to thank my amazing family for the love, support, and constant encouragement I have gotten over the years. In particular, I would like to thank my mother, my brothers and sisters. You are the salt of the earth, and I undoubtedly could not have done this without you. Vll TABLE OF CONTENTS List of Tables xii List of Figures xv Chapter 1. Introduction 1 1.1 Gene expression data analysis challenges 1 1.2 Motivations and proposed methods 3 1.2.1 Microarray meta-analysis 3 1.2.2 Cluster analysis of gene expression data 6 1.2.3 DNA sequence analysis 8 1.3 Research contributions 9 2. Microarray analysis background 13 2.1 Microarray 13 2.1.1 Microarray platforms 13 2.1.2 Gene expression microarray data analysis 16 2.2 Microarray data meta-analysis 18 2.2.1 Microarray data integration 18 2.2.2 Issues with microarray data meta-analysis 23 2.3 Cluster analysis 32 2.3.1 Clustering algorithms 34 2.3.2 Hierarchical clustering algorithms 35 2.3.3 K-Means clustering algorithm 36 2.3.4 The mixture model with expectation-maximization algorithm 37 2.3.5 Fuzzy clustering algorithm 39 vm 2.4 Gene regulatory sequence prediction 42 2.5 Datasets 45 2.5.1 Artificial datasets for cluster analysis 45 2.5.2 Artificial datasets for sequence analysis 46 2.5.3 Real clustering datasets 46 2.5.4 Gene expression datasets 48 2.5.5 CMAP datasets 49 2.5.6 Real biological sequence datasets 49 3. Fuzzy cluster analysis using FCM 53 3.1 FCM algorithm 53 3.2 FCM convergence 57 3.3 Distance measures 58 3.4 Partially missing data analysis 64 3.5 FCM Clustering solution validation 68 3.5.1 Global validity measures 70 3.5.2 Local validity measures 80 3.6 FCM partition matrix initialization 85 3.7 Determination of the number of clusters 87 3.7.1 Between-cluster-partition comparison 88 3.7.2 Within-cluster-partition comparison 89 3.8 Determination of the fuzzifier factor 90 3.9 Defuzzification of fuzzy partition 92 4. Methods 98 4.1 Microarray meta-analysis 98 IX 4.1.1 Selection of samples and features 98 4.1.2 Sample and feature selection method 98 4.1.3 Methods for feature metric 108 4.2 Fuzzy cluster analysis methods using Fuzzy C-Means (FCM) 109 4.2.1 Modification of the objective function 110 4.2.2 Partition matrix initialization method Ill 4.2.3 Fuzzy clustering evaluation method 121 4.2.4 Fuzzy clustering evaluation using Gene Ontology [180] 1 29 4.2.5 Imputation methods for partially missing data 137 4.2.6 Probability based imputation method (fzPBI) [154] 139 4.2.7 Density based imputation method (fzDBI) [155] 149 4.2.8 Probability based defuzzification (fzPBD) [156] 153 4.2.9 Fuzzy genetic subtractive clustering method (f zGASCE) [157] 161 4.3 Motif finding problem 172 4.3.1 HIGEDA algorithm 172 4.3.2 New motif discovery using HIGEDA 192 5. Applications 196 5.1 Recovery of a gene expression signature 196 5.2 Gene-drug association prediction 199 5.2.1 Model design 199 5.2.2 Model testing 205 5.3 Drug target prediction 216 5.4 Application of HIGEDA into prediction regulatory motifs 217 6. Conclusions and future work 218 References 221 XI LIST OF TABLES Table 3.1 : Performance of different distance measures on the Iris dataset 63 3.2 : Performance of different distance measures on the Wine dataset 63 3.3 : Average results of 50 trials using incomplete Iris data [140] 68 3.4 : Performance of three standard algorithms on ASET1 86 3.5 : Algorithm performance on ASET4 using MISC measure 96 3.6 : Algorithm performance on ASET5 using MISC measure 96 4.1 : Predicted HD AC antagonists 102 4.2 : Predicted HDAC agonists 103 4.3 : Predicted estrogen receptor antagonists 105 4.4 : Predicted estrogen receptor agonists 107 4.5 : Algorithm performance on the ASET1 dataset 118 4.6 : fzSC correctness in determining cluster number on artificial datasets 119 4.7 : fzSC performance on real datasets 120 4.8 : Fraction of correct cluster predictions on artificial datasets 125 4.9 : Validation method performance on the Iris dataset (3 true clusters) 125 4.10 : Validation method performance on the Wine dataset (3 true clusters) 126 4.11 : Validation method performance on the Glass dataset (6 true clusters) 126 4.12 : Validation method performance on the Yeast dataset (5 true clusters) 127 4.13 : Validation method performance on the Yeast-MIPS dataset (4 true clusters) 127 4.14 : Validation method performance on the RCNS dataset (6 true clusters) 128 4.15 : Degrees of belief of GO annotation evidences 132 4.16 : Gene Ontology evidence codes 133 Xll 4.17 : Validation method performance on the Yeast dataset using GO-BP 135 4.18 : Validation method performance on the Yeast-MIPS dataset using GO-BP.... 136 4.19 : Validation method performance on the RCNS dataset using GO-CC 136 4.20 : Validation method performance on the RCNS dataset using GO-BP 137 4.21 : Average results of 50 trials using an incomplete IRIS dataset with different percentages (%) of missing value 145 4.22 : ASET4- Compactness measure 157 4.23 : IRIS: Compactness measure 159 4.24 : WINE- Compactness measure 160 4.25 : Performance of GA algorithms on ASET2 168 4.26: Performance of GA algorithms on ASET3 169 4.27 : Performance of GA algorithm on ASET4 169 4.28 : Performance of GA algorithms on ASET5 170 4.29 : Performance of GA algorithms on IRIS 171 4.30 :Performanc of GA algorithms on WINE 171 4.31 : Average performance (LPC/SPC) on simulated DNA datasets 187 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA datasets (# of sequences/length of motif/# of motif occurrences) 188 4.33 : Detection of protein motifs (1-8, PFAM; 9-12, Prosite) 190 4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms 193 4.35 : Motifs of Presenilin-1 and Signal peptide peptidase by HIGEDA 194 5.1 : Clustering results with estrogen significantly associated drugs 197 5.2 : Estrogen signature and the 'signature' predicted by FZGASCE 198 5.3 : Hsa21 set-1 query genes 206 Xlll 5.4 : Expanded set of the Hsa21 set-1 207 5.5 : Gene clusters of the Hsa21 gene set-1 expanded set 207 5.6 : Proposed gene expression signature for the Hsa21 gene set-1 209 5.7 : Drugs predicted to enhance expression of Hsa21 gene set-1 210 5.8 : Drugs predicted to repress expression of Hsa21 gene set-1 211 5.9 : Hsa21 set-2 query genes 212 5.10 : Predicted gene expression signature for the Hsa21 gene set-2 213 5.11 : Drugs predicted to enhance expression of Hsa21 gene set-2 215 5.12 : Drugs predicted to repress expression of Hsa21 gene set-2 215 XIV LIST OF FIGURES Figure 1-1 : Microarray data analysis and the contributions of this research 12 2-1 : Microarray design and screening 15 2-2 : Microarray analysis 18 3-1 : Expression levels of three genes in five experimental conditions 60 3-2 : ASET2 dataset with five well-separated clusters 76 3-3 : PC index (maximize) 77 3-4 : PE index (minimize) 77 3-5 : FS index (minimize) 77 3-6 : XB index (minimize) 77 3-7 : CWB index (minimize) 78 3-8 : PBMF index (maximize) 78 3-9 : BR index (minimize) 78 3-10 : Performance of the global validity measures on artificial datasets with different numbers of clusters 79 3-11 : Dendrogram of the Iris dataset from a 12-partition generated by FCM 83 3-12: Dendrogram of the Wine dataset from a 13 -partition generated by FCM 84 3-13 : ASET1 dataset with 6 clusters 85 3-14 : PC index on the ASET2 dataset (maximize) 88 3-15 (Wu [144]) : Impact of m on the misclassification rate in Iris dataset 91 3-16 : Artificial dataset 5 (ASET5) with three clusters of different sizes 93 4-1 : Venn diagram for predicted HD AC antagonists 104 4-2 : Venn diagram for HD AC inhibitors 104 4-3 : Venn diagram for predicted estrogen receptor antagonists 1 06 XV 4-4 : Venn diagram for predicted estrogen receptor agonists 106 4-5 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares, cluster centers by FCM; dark circles, cluster centers found by fzSC. Classes are labeled 1 to 6 118 4-6 : Average RMSE of 50 trials using an incomplete ASET2 dataset with different percentages of missing values 142 4-7 : Average RMSE of 50 trials using an incomplete ASET5 dataset with different percentages of missing values 143 4-8 : Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values 144 4-9 : Average RMSE of 50 trials using an incomplete Wine dataset with different percentages of missing values 145 4-10 : Average RMSE of 50 trials using an incomplete RCNS dataset with different percentages of missing values 146 4-11: Average RMSE of 50 trials using an incomplete Yeast dataset with different percentages of missing values 147 4-12 : Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values 148 4-13 : Average RMSE of 50 trials using an incomplete Serum dataset with different percentages of missing values 148 4-14 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset with different missing value percentages 151 4-15 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset with different missing value percentages 151 4-16 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values 152 XVI 4-17 : (fzDBI) Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values 152 4-18 : Algorithm performance on ASET2 156 4-19 : Algorithm performance on ASET3 156 4-20 : Algorithm performance on ASET4 157 4-21 : Algorithm performance on ASET5 158 4-22 : Algorithm performance on IRIS 158 4-23 : Algorithm performance on WINE dataset 159 4-24 : Algorithm performance on GLASS dataset 160 4-25 : A motif model© 175 4-26 : Dynamic alignment of s='ACG' w.r.t from Figure 4-25 176 4-27 : v(t) = Vm x T/ (T + t 2 ). VMis the maximum value of v 184 4-28 : Strep-H-triad motif by HIGEDA 189 5-1 : Gene expression signature prediction 196 5-2 : Gene-Drug association prediction 199 5-3 : Expansion of gene set A 200 5-4 : Identification of key genes 202 5-5 : Gene expression signature generation and application 204 5-6 : Transcription factor and mircoRNA binding site prediction using HIGEDA...217 XVll 1. Introduction High-throughput microarray technology for determining global changes in gene expression is an important and revolutionary experimental paradigm that facilitates advances in functional genomics and systems biology. This technology allows measurement of the expression levels of thousands of genes with a single microarray. Widespread use of the technology is evident in the rapid growth of microarray datasets stored in public repositories. For example, since its inception in the early 1990s, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Information (NCBI), has received thousands of data submissions representing more than 3 billion individual molecular abundance measurements [6,7]. 1.1 Gene expression data analysis challenges The growth in microarray data deposition is reminiscent of the early days of GenBank, when exponential increases in publicly accessible biological sequence data drove the development of analytical techniques required for data analysis. However, unlike biological sequences, microarray datasets are not easily shared by the research community, resulting in many investigators being unable to exploit the full potential of these data. In spite of over 15 years of experience with microarray experiments, new paradigms for integrating and mining publicly available microarray results are still needed to promote widespread, investigator-driven research on shared data. In general, integrating multiple datasets is expected to yield more reliable and more valid results due to a larger number of samples and a reduction of study-specific biases. Ideally, gene expression data obtained in any laboratory, at any time, using any microarray technology should be comparable. However, this is not true in reality. Several issues arise when attempting to integrate microarray data generated using different array technologies. Meta-analysis and integrative data analysis only exploits the behavior of genes individually at the genotype level. In order to discover gene behaviors at the phenotype level by connecting genotype to phenotype, additional analyses are required. Cluster analysis, an unsupervised learning method, is a popular method that searches for patterns of gene expression associated with an experimental condition. Each experimental condition usually corresponds to a specific biological event, such as a drug treatment or a disease state, thereby allowing discovery of the drug targets, and drug and disease relationships. Numerous algorithms have been developed to address the problem of data clustering. These algorithms, notwithstanding their different approaches deal with the issue of determining the number and the construction of clusters. Although some cluster indices address this problem, they still have the drawback of model over-fitting. Alternative approaches, based on statistics with the log-likelihood estimator and a model parameter penalty mechanism, can eliminate over- fitting, but are still limited by assumptions regarding models of data distribution and by a slow convergence with model parameter estimation. Even when the number of clusters is known a priori, different clustering algorithms may provide different solutions, because of their dependence on the initialization parameters. Because most algorithms use an iterative process to estimate the model parameters while searching for optimal solutions, a solution that is a global optimum is not guaranteed [122,132]. 1.2 Motivations and proposed methods This dissertation focuses on construction of a machine learning and data mining framework for discovery of biological meaning in gene expression data collected from multiple sources. Three tools have been developed: (i) a novel meta-analysis method for identifying expression profiles of interest; (ii) set of novel clustering methods to incorporate prior biological knowledge and to predict co-regulated genes; and (iii) a novel motif-finding method to extract information on shared regulatory sequences, Together these methods provide comprehensive analysis of gene expression data. 1.2.1 Microarray meta-analysis Several methods have been proposed to combine inter-study microarray data. These methods fall into two major categories based on the level at which data integration is performed: meta-analysis and direct integration of expression values. Meta-analysis integrates data from different studies after they have been pre-analyzed individually. In contrast, the latter approach integrates microarray data from different studies at the level of expression after transforming expression values to numerically comparable measures and/or normalization and carries out the analysis on the combined dataset. Meta-analysis methods combine the results of analysis from individual studies in order to increase the ability to identify significantly expressed genes or altered samples across studies that used different microarray technologies (i.e., different platforms or different generations of the same platform). Lamb et al. [25] and Zhang et al. [26] proposed rank-based methods for performing meta-analysis. They started within the Connectivity Map (CMAP) datasets, a collection of one-off microarray screenings of >1000 small-molecule treatments of five different human cell lines. Genes in each array were ranked using statistical methods and the array data were then integrated, based on how they fit a given gene expression signature, defined as a set of genes responding to the biological context of interest. Gene expression signatures could then be used to query array data generated using other array technologies. These methods successfully identified a number of small molecules with rational and informative associations with known biological processes. However, Lamb et al. [25] weighted up-regulated genes and down-regulated genes differently, while Zhang et al. [26] showed that genes with the same amount of expression change should be weighted equally. Both Zhang et al. [26] and Lamb et al. [25] used fold change criteria (where "fold change", FC, is the ratio between the gene expression value of treated and untreated conditions) to rank the genes in expression profiles and they solved the problem of platform-specific probes by eliminating them. Because of the experimental design and the noise inherent in the screening process and imaging methods, a single gene may have probesets that differ in the direction of expression change, e.g., some probesets may be up-regulated and others down-regulated. This problem was addressed by Lamb et al. [25] and Zhang et al. [26] by including all probesets in the ranked list and averaging probesets across genes. In fact, it is not biologically necessary that a single differentially expressed gene show the identical FC for all probesets. It is not straightforward to compare differentially expressed genes just by averaging their probeset values. Witten et al. [9] showed that FC-based approaches, namely FC-difference and FC-ratio, are linear-comparison methods and are not appropriate for noisy data and the data from different experiments. In addition, these methods are unable to measure the statistical significance of the results. While approaches using the t-test were proposed, Deng [10] showed that the Rank-Product (RP) method is not only more stable than the t-test, but it also provides the statistical significance of the gene expression differences. Breitling et al. [29] proposed an RP -based approach that allows the analysis of gene expression profiles from different experiments and offers several advantages over the linear approaches, including the biologically intuitive FC criterion, using fewer model assumptions. However, both the RP and t-test methods, as well as other popular meta-analysis methods, have problems with small datasets and datasets with a small number of replicates. Hong et al. [30] proposed an RP-based method that is most appropriate, not only for small sample sizes, but also for increasing the performance on noisy data. Their method has achieved widespread acceptance and has been used in such diverse fields as RNAi analysis, proteomics, and machine learning [8]. However, most RP-based methods, unlike t-test based methods, do not provide an overall measure of differential expression for every gene within a given study [54]. We develop a meta-analysis method using RP first to determine the differentially expressed genes in each study and then to construct ordered lists of up and down regulated genes in each study based on the p-values determined using a permutation- based method. We then use the non-parametric rank-based pattern-matching strategy based on the Kolmogorov-Smirnov statistic [25] to filter the profiles of interest. We again use RP to filter genes which are differentially expressed across multiple studies. In our method, a t-test based approach can also be used in the first step to generate an estimated FC for each gene within each study. These values are then used in the cluster analysis, instead of the average FC values, to group the genes based on their expression change pattern similarity across multiple studies. 1.2.2 Cluster analysis of gene expression data Numerous clustering algorithms are currently in use. Hierarchical clustering results in a tree structure, where genes on the same branch at the desired level are considered to be in the same cluster. While this structure provides a rich visualization of the gene clusters, it has the limitation that a gene can be assigned only to one branch, i.e., one cluster. This is not always biologically reasonable because most, if not all, genes have multiple functions. In addition, because of the mechanism of one-way assignment of genes to branches, the results may not be globally optimal. The alternative approach, partitioning clustering, includes two major methods, heuristic-based and model-based. The former assigns objects to clusters using a heuristic mechanism, while the latter uses quantifying uncertainty measures. Probability and possibility are the two uncertainty measures commonly used. While probabilistic bodies of evidence consist of singletons, possibilistic bodies of evidence are families of nested sets. Both probability and possibility measures are uniquely represented by distribution functions, but their normalization requirements are very different. Values of each probability distribution are required to add to 1, while for possibility distributions, the largest values are required to be 1. Moreover, the latter requirement may even be abandoned when possibility theory is formulated in terms of fuzzy sets [172-173, 179]. The mixture model with the expectation-maximization (EM) algorithm is a well-known method using the probability-based approach. This method has the advantages of a strong statistical basis and a statistics-based model selection. However, the EM algorithm converges slowly, particularly at regions where clusters overlap [27] and requires the data distribution to follow some specific distribution model. Because gene expression data are likely to contain overlapping clusters and do not always follow standard data distributions, (e.g., Gaussian, Chi-squared) [28], the mixture model with the EM method is not appropriate. Fuzzy clustering using the most popular algorithm, Fuzzy C-Means (FCM) [92, 121], is another model-based clustering approach that uses the possibility measure. FCM both converges rapidly and allows assignment of objects to overlapping clusters using the fuzzifier factor, m, where l<m<oc. When the value of m changes, the cluster overlaps will change accordingly and there will be no overlapping regions in the clustering results when the value of m is equal to 1 . However, similar to the EM-based method and the other partitioning methods, FCM has the problem of determining the correct number of clusters. Even if the cluster number is known a priori, FCM may provide different cluster partitions. Cluster validation methods are required to determine the optimal cluster solution. Unfortunately, the clustering model of FCM is a possibility-based one. There is no straightforward statistics-based approach to evaluate a clustering model except the cluster validity indices based on the compactness and separation factors of the fuzzy partitions. However, there are problems with these validity indices; both with scaling the two factors and with over-fit estimates. We have combined the advantages of the EM and FCM methods, where FCM plays the key role in clustering the data, and proposed a method to convert the clustering possibility model into a probability one and then to use the Central Limit Theorem to compute the clustering statistics and determine the data distribution model that best describes the dataset. We applied the Bayesian method with log-likelihood ratio and Akaike Information Criterion measure using the estimated distribution model for a novel validation method for fuzzy clustering partitions [147]. 1.2.3 DNA sequence analysis Genes within a cluster are expressed similarly under the given experimental conditions, e.g., a drug treatment, a comparison between the normal and disease tissues, or other biological comparisons of interest. Genes that are similarly expressed may have regulatory DNA sequences in common and appropriate analysis may identify motifs that are overrepresented in these gene sequences. Such motifs may contribute to expression of the entire group [31]. In addition, regarding clusters of down-regulated genes, we can apply the sequence analysis to the 3' untranslated regions (3 f UTR) potentially discovering motifs connected to miRNAs [32]. We may discover the relationships between the miRNAs and the diseases or the treatment conditions targeted by the microarray experiments. Most algorithms developed to find motifs in biological sequences do not successfully identify motifs containing gaps and do not allow a variable number of motif instances in different sequences. In addition, they may converge to local optima [33] and therefore do not guarantee that the motifs are indeed overrepresented in the sequence set. The MEME algorithm [34] solved many of these problems, but still does not model gapped motifs well. By combining the EM algorithm with the hierarchical genetics algorithm, we developed a novel motif discovery algorithm, HIGEDA [145], that automatically determines the motif consensus using Position Weight Matrix (PWM). By using a dynamic programming (DP) algorithm, HIGEDA also identifies gapped motifs. 1.3 Research contributions The rapid growth of microarray repositories has increased the need for effective methods for integration of datasets among laboratories using different microarray technologies. The meta-analysis approach has the advantage of integrating data without the problem of scaling expression values and also has multiple supporting methods to detect and order significantly differentially expressed genes. However, parametric statistics methods may perform differently on different datasets, because of different assumptions about the method model. We propose using the RP method [8] to filter the genes for differential expression. We also propose to integrate RP with a pattern matching ranking method, in a method named RPPR [169], a combination of the Rank Product and Rank Page methods, to more effectively filter gene expression profiles from multiple studies. Meta-analysis of gene expression data does not put differentially expressed genes together at the phenotype level of the experiments. Therefore, the real power of microarrays is not completely exploited. Cluster analysis can discover relationships among genes, between genes and biological conditions, and between genes and biological processes. Most clustering algorithms are not appropriate for gene expression data analysis because they do not allow overlapping clusters and require the data to follow a specific distribution model. We developed a new clustering algorithm, fzGASCE [157] that combines the well-known optimization algorithm, Genetic Algorithm (GA), with FCM and fuzzy subtractive clustering to effectively cluster gene expression data without a priori specification of the number of clusters. Regarding the parameter initialization problem of the FCM algorithm, we developed a novel Fuzzy Subtractive Clustering (fzSC) algorithm [146] that uses a fuzzy partition of data instead of the data themselves. fzSC has advantages over Subtractive Clustering in that it does not require a priori specification of mountain peak and mountain radii, or the criterion on how to determine the number of clusters. In addition, the computational time of fzSC is O(cn) instead of 0(n ), where c and n are the numbers of clusters and data objects respectively. To address the problem of missing data in cluster analysis, we developed two imputation methods, fzDBI and fzPBI, using histogram based and probability based approaches to model the data distributions and apply the model in imputation of missing data. For the problem of cluster model selection and clustering result evaluation with the FCM algorithm, we developed a novel method, fzBLE, [147] that uses the likelihood function, a statistics goodness-of-fit, and a Bayesian method with the Central Limit Theorem, to effectively describe the data distribution and correctly select the optimal fuzzy partition. Using the statistical model of fzBLE, we developed a 10 probability based method for defuzzification of fuzzy partition that helps with generating classification information of data objects using fuzzy partition. In addition to clustering of gene expression data, we also propose an analysis directed at the biological relevance of the cluster results. Using our RPPR method, we provide relationships between the gene clusters and the biological conditions of interest, and determine the clusters which respond positively and negatively to the experimental conditions. Furthermore, our motif- finding algorithm, HIGEDA [145], can discover common motifs in the promoter and mRNA sequences of genes in a cluster. In addition to predicting the possible transcription factor binding sites, HIGEDA can also be used to predict potential miRNA binding sites. Figure 1-1 shows the microarray data analysis key issues which are addressed in this research. The remainder of the dissertation is organized as follows: Chapter 2 presents the background of microarray analysis including microarray technology, microarray data analysis and recent approaches, and the goals of this research; Chapter 3 describes a rigorous analysis of the Fuzzy C-Means algorithm; Chapter 4 describes our methods to address the challenges of gene expression microarray data analysis; and Chapter 5 demonstrates some specific applications of our approach, concludes our contributions and discusses our future work. 11 Microarray from labl Microarray from lab2 Microarray from lab (p) Gene expression profiles of interest Determination of optimal fuzzy partition (fzSC) Cluster analysis using fzGASCE, Missing data imputation using fzDBI, fzPBI Cluster validation (fzBLE) OK Defuzzification fzPBD Gene groups Group selection (RPPR) t _ Motif discovery (HIGEDA) .1 Biological context relevance miRNA prediction Figure 1-1 : Microarray data analysis and the contributions of this research 12 2. Microarray analysis background 2.1 Microarray DNA microarrays rely on the specificity of hybridization between complementary nucleic acid sequences in DNA fragments (termed probes) immobilized on a solid surface and labeled RNA fragments isolated from biological samples of interest [6]. A typical DNA microarray consists of thousands of ordered sets of DNA fragments on a glass, filter, or silicon wafer. After hybridization, the signal intensity of each individual probe should correlate with the abundance of the labeled RNA complementary to that probe [1]. 2.1.1 Microarray platforms DNA microarrays fall into two types based on the DNA fragments used to build the array: complementary DNA (cDNA) arrays and oligonucleotide arrays. Although a number of subtypes exist for each array type, spotted cDNA arrays and Affymetrix oligonucleotide arrays are the major platforms currently in use. The choice of which microarray platform to use is based on the research needs, cost, available expertise, and accessibility [1]. For cDNA arrays, cDNA probes, which are usually generated by a polymerase chain reaction (PCR) amplification of cDNA clone inserts (representing genes of interest), are robotically spotted on glass slides or filters. The immobilized sequences of cDNA probes may range greatly in length, but are usually much longer than those of the corresponding oligonucleotide probes. The major advantage of cDNA arrays is the flexibility in designing a custom array for specific purposes. Numerous genes can be 13 rapidly screened, which allows very quick elaboration of functional hypotheses without any a priori assumptions [120]. In addition, cDNA arrays typically cost approximately one-fourth as much as commercial oligonucleotide arrays. Flexibility and lower cost initially made cDNA arrays popular in academic research laboratories. However, the major disadvantage of these arrays is the amount of total input RNA needed. It is also difficult to have complete control over the design of the probe sequences. cDNA is generated by the enzyme reverse transcriptase RNA-dependent DNA polymerase, and like all DNA polymerases, it cannot initiate synthesis de novo, but requires a primer. It is therefore difficult to generate comprehensive coverage of all genes in a cell. Furthermore, managing large clone libraries, and the infrastructure of a relational database for keeping records, sequence verification and data extraction is a challenge for most laboratories. For oligonucleotide arrays, probes are comprised of short segments of DNA complementary to the RNA transcripts of interest and are synthesized directly on the surface of a silicon wafer. When compared with cDNA arrays, oligonucleotide arrays generally provide greater gene coverage, more consistency, and better quality control of the immobilized sequences. Other advantages include uniformity of probe length, the ability to discriminate gene splice variants, and the availability of carefully designed standard operating procedures. Another advantage particular to Affymetrix arrays is the ability to recover samples after hybridization to an array. This feature makes Affymetrix arrays attractive in situations where the amount of available tissue is limited. However, a major disadvantage is the high cost of arrays [1]. 14 Following hybridization, the image is processed to obtain the hybridization signals. There are two different ways to measure signal intensity. In the two-color fluorescence hybridization scheme, the RNA from experimental and control samples (referred to as target RNAs) are differentially labeled with fluorescent dyes (Cye5 - red vs. Cye3 - green) and hybridized to the same array. When the region of the probe is fluorescently illuminated, both the experimental and control target RNAs fluorescence and the relative balance of red versus green fluorescence indicate the relative expression levels of experimental and control target RNAs. Therefore, gene expression values are reported as ratios between the two fluorescent values [1]. RNA extraction £ ?, Feature-level data (eel, grp...) Image processing i T Sample preparation and labelling LJJ Hybridization ^^■■Ifi — t_ T Image acquisition 1 Washing \— — ► Clean pins Figure 2-1 : Microarray design and screening 15 Affymetrix oligonucleotide arrays use a one-color fluorescence hybridization system where experimental RNA is labeled with a single fluorescent dye and hybridized to an oligonucleotide array. After hybridization, the fluorescence intensity from each spot on the array provides a measurement of the abundance of the corresponding target RNA. A second array is then hybridized to the control RNA, allowing calculation of expression differences. Because Affymetrix array screening generally follows a standard protocol, results from different experiments in different laboratories can theoretically be combined [6]. Following image processing, the digitized gene expression data need to be pre- processed for data normalization before carrying out further analysis. 2.1.2 Gene expression microarray data analysis Regarding differentially expressed genes, many protocols use a cutoff of a twofold difference as a criterion. However, this arbitrary cutoff value may be either too high or too low depending on the data variability. In addition, inherent data variability is not taken into account. A data point above or below the cutoff line could be there by chance or by error. To ensure that a gene is truly differentially expressed requires multiple replicate experiments and statistical testing [8]. However, not all microarray experiments are array-replicated. We, therefore, need to analyze data generated by different protocols. Statistical analysis, including meta-analysis, uses microarrays to study genes in isolation while the real power of microarrays is their ability to study the relationships between genes and to identify genes or samples that behave in a similar manner. 16 Machine learning and data mining approaches have been developed for further analysis procedures. These approaches can be divided into unsupervised and supervised methods. Unsupervised methods involve the aggregation of samples, genes, or both into different clusters based on the distance between measured gene expression values. The goal of clustering is to group objects with similar properties, leading to clusters where the distance measure is small within clusters and large between clusters. Several clustering methods from classical pattern recognition, such as hierarchical clustering, K- Means clustering, fuzzy C-Means clustering, and self-organizing maps, have been applied to microarray data analysis. Using unsupervised methods, we can search the resulting clusters for candidate genes or treatments whose expression patterns could be associated with a given biological condition or gene expression signature. The advantage of this method is that it is unbiased and allows for identification of significant structure in a complex dataset without any prior information about the objects. In contrast, supervised methods integrate the knowledge of sample class information into the analysis with the goal of identifying expression patterns (i.e., gene expression signatures) which could be used to classify unknown samples according to their biological characteristics. A training dataset, consisting of gene expression values and sample class labels, is used to select a subset of expressed genes that have the most discriminative power between the classes. It is then used to build a predictive model, also called a classifier (e.g., k-nearest neighbors, neural network, support vector machines), which takes gene expression values of the pre-selected set of genes of an unknown sample as input and outputs the predicted class label of the sample. 17 Application specific Technology specific Statistical analysis: meta-analysis, DEG Data mining: Clustering, Classification Sequence analysis Biological Question l Experimental Design Experiment Quantification ' ^"P^^smg 1 Normalization I Estimation 1 I Testing -ir v ( > Clustering Classification r Analysis Regulatory prediction Figure 2-2 : Microarray analysis 2.2 Microarray data meta-analysis 2.2.1 Microarray data integration The integrated analysis of data from multiple studies generally promises to increase statistical power, generalizability, and reliability, while decreasing the cost of analysis, because it is performed using a larger number of samples and the effects of 18 individual study- specific biases are reduced. There are two common approaches for the problem of integrating microarray data: meta-analysis and direct integration of expression values. The direct integration procedure [11-13, 17] can be divided into the following steps. First, a list of genes common to multiple distinct microarray platforms is extracted based on cross-referencing the annotation of each probe set represented on the microarrays. Cross-referencing of expression data is usually achieved using UniGene or LocusLink/EntrezGene databases or the best matching mapping provided by Affymetrix. Next, for each individual dataset, numerically comparable quantities are derived from the expression values of genes in the common list by application of specific data transformation and normalization methods. Finally, the newly derived quantities from individual datasets are combined. Direct integration methods include the following: Ramaswamy et al. [14] re-scaled expression values of a common set of genes for each of five cancer microarray datasets generated by independent laboratories using different microarray platforms. Combining them to form a dataset with increased sample size allowed identification of a gene expression signature that distinguished primary from metastatic cancers. Bloom et al. [15] used a scaling approach based on measurements from one common control sample to integrate microarray data from different platforms. Shen et al. [16] proposed a Bayesian mixture model to transform each raw expression value into a probability of differential expression (POD) for each gene in each independent array. Integrating multiple studies on the common probability scale of POD, they developed a 90-gene meta-signature that predicted relapse-free 19 survival in breast cancer patients with improved statistical power and reliability. In addition to common data transformation and normalization procedures, Jiang et al. [17] proposed a distribution transformation method to transform multiple datasets into a similar distribution before data integration. Data processed by distribution transformation showed a greatly improved consistency in gene expression patterns between multiple datasets. Warnat et al. [18] used two data integration methods, median rank scores and quartile discretization, to derive numerically comparable measures of gene expression from independent datasets. These transformed data were then integrated and used to build support vector machine classifiers for cancer classification. Their results showed that cancer classification based on microarray data could be greatly improved by integrating multiple datasets with a similar focus. The classifiers built from integrated data showed high predictive power and improved generalization performance. A major limitation of these direct integration methods is that filtering genes to generate a subset common to multiple distinct microarray platforms often excludes many thousands of genes, some of which may be significant. However, data transformation and normalization methods are resource-sensitive [20]; one method may be a best fit for some datasets, but not for others. It is difficult to come to a consensus regarding a method that is best for data transformation and normalization on given datasets. Several studies have shown that expression measurements from cDNA and oligonucleotide arrays may show poor correlation and may not be directly comparable 20 [21]. These differences may be due to variances in probe content, deposition technologies, labeling and hybridizing protocols, as well as data extraction procedures (e.g., background correction, normalization, and calculation of expression values). For example, cDNA microarray data is usually defined as ratios between experimental and control values and cannot be directly compared with oligonucleotide microarray data that are defined as expression values of experimental samples. Across-laboratory comparisons of microarray data has also demonstrated that sometimes there are larger differences between data obtained in different laboratories using the same microarray technology than data obtained in the same laboratory using different microarray technologies [22]. Wang et al. [116] also showed that the agreement between two technologies within the same lab was greater than that between two labs using the same technology; the lab effect, especially when confounded with the RNA sample effect, usually plays a bigger role than the platform effect on data agreement. Commercial microarrays, such as Affymetrix arrays, have produced several generations of arrays to keep up with advances in genomic sequence analysis. The number of known genes and the representative composition of gene sequences are frequently updated and probe sets are modified or added, to better detect target sequences and to represent newly discovered genes. A recent study has shown that expression measurements within one generation of Affymetrix arrays are highly reproducible, but that reproducibility across generations depends on the degree of similarity of the probe sets and the levels of expression measurements [17]. Therefore, 21 even when using the same microarray technology, different generations of microarrays make direct integration difficult. Technical variabilities, which result from differences in sample composition and preparation, experimental protocols and parameters, RNA quality, and array quality, pose further challenges to the direct integration of microarray data from independent studies. Vert et al. [20], and Irizarry et al. [22] described the lab-affect for microarray data and concluded that direct integration of expression data is not appropriate. Cahan et al. [21] and Warnat et al. [24], using two methods to identify differentially expressed genes prior to carrying out a classification analysis, showed that gene expression levels themselves could not be directly compared between different platforms. Therefore, we propose to utilize the meta-analysis, instead of direct integration, before further computer analyses. The meta-analysis method, in contrast to the direct integration method, combines results from individual analyses. It therefore avoids the problem of scaling the gene expression levels among datasets from different laboratories. A number of studies have shown that meta-analysis provides a robust list of differentially expressed genes [21- 23]. With the increasing use of next generation sequencing techniques, microarrays are no longer the only high-throughput technology for gene expression studies. The direct integration approach appears to be inappropriate to those repositories, because we can only scale the expression values across multiple experiments using the same technology. Even with the same technology, e.g., Affymetrix oligonucleotide and 22 cDNA microarrays, the use of this approach seems impossible with scaling expression values. 2.2.2 Issues with microarray data meta-analysis Considering that meta-analysis involves the whole process from choosing microarray data to detecting differentially expressed genes, the key issues are as follows. Issue 1: Identify suitable microarray datasets The first step is to determine which datasets to use regarding the goal of analysis. This is done by determining the inclusion-exclusion criteria. The most important criterion is that the datasets collected should be in the standard gene expression data format, e.g. features by samples [23]. Issue 2: Annotate the individual datasets Microarray probe design uses short, highly specific regions in genes of interest because using the full-length gene sequences can lead to non-specific binding or noise. Different design criteria lead to the creation of different probes for the same gene. Therefore, one needs to identify which probes represent a given gene within and across the datasets. The first option is to cluster probes based on sequence data [11, 17]. A sequence match method is especially appropriate for cross-platform data integration, as well as Affymetrix cross-generation data integration. However, the probe sequence may not be available for all platforms and the clustering of probe sequences could be computationally intensive for very large numbers of probes. 23 Alternatively, one can map probe-level identifiers such as Image ClonelD, Affymetrix ID, or GenBank accession numbers to a gene-level identifier such as UniGene, RefSeq, or LocusLink/EntrezGene. UniGene, which is an experimental system for automatically partitioning sequences into non-redundant gene-oriented clusters, is a popular choice to unify the different datasets. For example, UniGene Build #211 (released March 12, 2008) reduces the nearly 7 million human cDNA sequences to 124,181 clusters. To translate probe-level identifiers to gene-level identifiers, one can use either the annotation packages in BioConductor, or the identifier mapping tables provided by NCBI and AffymetrixID for LocusLink/Entrez ID to probe ID, or UniGene for probe ID to RefSeq. Issue 3: Resolve the many-to-many relationships between probes and genes The relationship between probes and genes is unfortunately not unequivocal, which means that in some cases a probe may report more than one gene, and vice versa. Even using the same Affymetrix platform, the combination of different chip versions creates serious difficulties, because the probe identification labels (IDs) are not conserved from chip to chip. Therefore, to combine microarray data across studies, a unique nomenclature must be adopted and all the different IDs of the chips must be translated to a common system. It is reasonable that many probe identifiers are mapped onto one gene identifier. This is due to the current UniGene clustering and genome annotation, because multiple probes per gene provide internal replicates, and allow for poor performance of some probes without the loss of gene expression detection, and because microarray chips 24 contain duplicate spotted probes. The issue is when a probe identifier can be mapped to many gene identifiers. This may lead to a problem with the further meta-analysis. For example, a probe could map to gene X in half of the datasets, but to both genes X and Y in the remaining datasets. The further meta-analysis will treat such probes as two separate gene entities, failing to fully combine the information for GenelD X from all studies. If one simply throws away such probes, valuable information may be lost to further analysis. Issue 4: Choosing the meta-analysis technique The decision regarding which meta-analysis technique to use depends on the specific application. In this context, we focus on a fundamental application of microarrays: the two-class comparison, e.g. the class of treatment samples and the class of control samples, where the objective is to identify genes differentially expressed between two specific conditions. Let X Nx p represent the expression matrix of selected datasets from multiple studies; where N is the number of common genes in the P selected samples. Let K be the number of studies from which the samples were selected. Denote by Pk the number of samples that belong to the k th study. Let P k c be the number of control samples and P k be the number of treatment or experimental samples from the k th study, then K P k = P k c + P k and P = ^ P k . For Affymetrix oligonucleotide array experiments, we k=l have P k c chips with gene expression measures from the control class and P k chips with gene expression measures from the treatment class. For the two-channel array 25 experiments, we assume that the comparisons of log-ratios are all indirect, i.e., the samples from the treatment class are hybridized against a reference sample RA. Then the expression values from the k th study are collected into X: Xl=log 2 (Ji/R),} = l,...,VZ and X^ q =log 2 (C q /R),q = l,...,P k c . The meta-analysis will be performed on the results from analyses that were performed on X. There are four common ways to integrate such information across studies. Vote counting This method simply counts the number of studies where a gene was declared as significant [48]. If the number of studies is small, this method can be visualized using Venn diagrams. Combining ranks Unlike vote counting, this technique accounts for the order of genes declared significant [30, 52-54]. There are three different approaches to aggregate the rankings of, say, the top 100 lists (the 100 most significantly up-regulated or down-regulated genes) from different studies [23]. Two of the algorithms use Markov chains to convert the pair-wise preference between the gene lists to a stationary distribution; the third algorithm is based on an order-statistics model. The rank product method, proposed by Breitling et al. [29] is the most popular of this approach using the third algorithm. It is a non-parametric statistic which is originally used to detect differentially expressed genes in a single dataset. It is derived from the biological reasoning of the fold change (FC) criterion and it detects the genes that are consistently found among the most strongly up- 26 regulated (or down-regulated) genes in a number of replicate experiments. However, this method offers a natural way to overcome the heterogeneity among multiple datasets and, therefore, can be extended to meta-analysis, which generates a single significance measurement for each gene in the combined study [30]. Within a given dataset of the study k, the pairwise fold change is computed for every gene i,pFC kq =T k /C q , k = l,...,Pj[,q = l,...,P£. There are Pk x Pjc suc h ra ti° values for each gene i, let R=Pk x Pk • These ratios are ranked within comparison to generate a ranking matrix rFC N: R. The rank product (RP) is computed as the geometric mean of rank for each gene i, by r r V /R RPi= ri rFc J • (zi) V r=l J RPi/n can be interpreted as a p-value, because it describes the probability of observing gene i at rank rFQ or better in the q th comparison, q=l,...,R. A permutation procedure is independently carried out to generate all other possible combinations of ranks relative to gene i in the dataset, nRPi. Repeat this procedure B times, where B is a positive number, to form reference distribution with nRPi across R comparisons which is then used to compute the (adjusted) p- value and false discovery rate (FDR) for each gene. Combining p-values In the 1920s, Fisher developed a meta-method that combined the p-values from individual studies [117]. In the k th study, for the i th gene, the p-value pik is generated by one-sided hypothesis testing. The logs of the p-values of the same 27 gene are summed across K studies using the Fisher's sum of logs method which is defined as follows [50, 51], S i= -2f>g 2 (p ik ). (2.2) k=l The Si values are then assessed by comparison against a chi-square distribution with 2K degrees of freedom, % 2 2K • Rhodes et al. [36] proposed a statistical model for performing meta- analysis of four independent prostate cancer microarray datasets from cDNA arrays and Affymetrix arrays. Each gene in each study was treated as an independent hypothesis and a significance (denoted by one p value and one q value) was assigned to each gene in each study based on random permutations. The similarity of significance across studies was assessed with meta-analysis methods and combined with multiple inference statistical tests for each possible combination of studies. A cohort of genes was identified to be consistently and significantly deregulated in prostate cancer. Marot et al. [40] used a sequential meta-analysis method in which they computed p-values in each sequential step and combined p-values from different sequential steps to determine the p-values for the results which are either significant genes or biological processes throughout the entire analysis process. Combining effect-sizes The first step is to calculate the effect size dik and the variance 8ik associated with the effect size for gene i in the k th study, i = 1,. . .,N; k = 1,. . .,K. 28 d ik =^^, (2.3) 'k where S£ is the pooled standard deviation estimated across N genes in the th dataset of the k study. $1 = JX( N . k "1)< /(IX -N) • (2-4) i=l / i=l Based on the mean differences, an estimated pooled mean difference, |Lii, and its variance, af , are computed for each gene across K studies. A z-score zi is then calculated using \i{ andaf : z i= m/V?- (2.5) Effect size can be calculated using the correlation coefficient and the method of Cohen [135], which is the difference between the means of two groups standardized by its pooled standard deviation [37, 48, 51]. The statistical significance can be determined using a comparison against N(0,1) on these z- scores. However, if the number of studies, K, is small, there may be a problem with over-fitting. In this case, the permutation method is used instead. The method from Hedges and Olkin [136] showed that this standardized difference overestimates the effect size for studies with small sample sizes. They proposed a small correction factor to calculate the unbiased estimate of the effect size, which is known as the Hedges' adjustment. The study-specific effect sizes for every gene are then combined across studies into a weighted average. The weights are inversely proportional to the variance of the study-specific 29 estimates. Choi et al. [37-38] introduced a new meta-analysis method, which combines the results from individual datasets in the form of effect size and has the ability to model the inter-study variation. The effect size was defined to be a standardized mean difference between the test samples and normal samples. The effect sizes from multiple microarray datasets were combined to obtain an estimate of the overall mean, and statistical significance was determined by the permutation test extended to multiple datasets. It was demonstrated that data integration using this method promoted the discovery of small but consistent expression changes and increased the sensitivity and reliability of the analysis. An extended effect size model was then proposed by Hu et al. [39] for meta- analysis of microarray data. Issue 5: Samples /features selection Regardless of the methods of data integration, finding features or samples that together represent the biological meaning for the integrated datasets remains a challenge. Lin [42] used a heuristics method to detect significant genes from integrated data. The underlying spaces were determined using a heuristics search. Then, a set of candidate genes lists were identified again using a heuristics search. Parkhomenko et al. [43] used principle component analysis (PC A) to determine the most important components of the data spaces. Pairs of components are processed one by one using a correlation measurements method to look for the 'best' combination of components in the dataset. This analysis is also a heuristic-based one. Su et al. [44] used a set of genes similar to the gene expression signature targeted to the biological context of interest, to 30 detect candidate genes called VIP (Very Important Pool) genes. These genes control the search process to find the significant samples (in the integrated dataset, the samples are columns, as we mentioned in Issue 1) and genes from the integrated data. Tan et al. [47] approached the problem using GA and a resampling method; the feature and sample candidates were selected under a controlled process using GA and the fitness function was computed using statistical test results on resample sets. Traditional methods to identify similar samples on the basis of gene expression values are Principle Component Analysis and Hierarchical Cluster Analysis. These methods, however, have the problem of strongly relating samples of the same cell lines and batches because of the similarity among cells grown at the same time [25]. In addition, these methods require all expression profiles to be generated on the same microarray platform, because the determination of distances between expression profiles from different platforms is not straightforward, which limits use to the available microarray repositories. Lastly, these methods may not benefit from additional information regarding the biological context of interest. For such cases, Huttenhower et al. [49] proposed a meta-analysis method based on a statistical approach within a Bayesian framework. Considering the biological context of the analysis, they measured the distance between every pair of genes and constructed a Bayesian framework for each specific biological context based on the set of arrays. The Bayesian framework was then used to compute the probability of the relevance of each array regarding the context. This probabilistic measurement can be used to filter the samples. Because their method requires a set of genes targeting the biological context, in addition to a 31 definition by the biologists, the biological context can be defined by a set of genes having specific GO (Gene Ontology) terms, or participating in a specific biological process, or belonging to a gene expression signature. Lamb et al. [25] and Zhang et al. [26], by using gene expression signatures proposed pattern-matching, used ranking-based approaches to perform meta-analysis on CMAP datasets. Genes in each expression profiles were ranked using fold changes. Expression profiles were then integrated based on how they fit a given gene expression signature. Their research successfully identified a number of interesting compounds that correlated with known biological processes. 2.3 Cluster analysis After meta-analysis, we identify a list of genes differentially expressed in a single study or across multiple studies. Using this gene list, we filter out the matrix of expression levels of genes significantly expressed under the experimental conditions. Cluster analysis then groups genes where genes within the same cluster have similar expression patterns across studies. This analysis allows the simultaneous comparison of all clusters independently of the platform used and the species studied; similar samples under specific biological conditions from different species can also be analyzed together by meta-analysis. This comparison allows identification of: i) robust signatures of a pathology or a treatment across several independent studies, ii) sets of genes that may be similarly modulated in different disease states or following drug treatments, iii) common sets of co-expressed genes between human and animal models. Because a typical microarray study generates expression data for thousands of genes from a 32 relatively small number of samples, an analysis of the integrated data from several sources or laboratories may provide outcomes that may not necessarily be related to the biological processes of interest. Cluster analysis will further explore the groups of genes that are absolute key agents of the experimental results. Given a set of n objects X = fxj, X2,...,x n J, let 9 - {U, V}, where cluster set V = {vi, V2,...,v c J and partition matrix U={ukJ, k=l,...,c, i=l,...,n, be a partition of X such that X = u£ =1 v k and ^° u ki = l,i = l,...,n . Each subset Vk of X is called a cluster and {uki} is the membership degree of {xi} to Vk. Uki e {0,1} if is crisp partition, otherwise, u^ e[0,l]. The goal of cluster analysis is to assign objects to clusters such that objects in the same cluster are highly similar to each other while objects from different clusters are as divergent as possible. These sub-goals create what we call the compactness and separation factors that are used, not only for modelling the clustering objectives, but also for evaluating the clustering result. These two parameters can be mathematically formulated in many different ways that lead to numerous clustering models. A dataset containing the objects to be clustered is usually represented in one of two formats, the object data matrix and the object distance matrix. In an object data matrix, the rows usually represent the objects and the columns represent the attributes of the objects regarding the context where the objects occur. The roles of the rows and columns can be interchanged for another representation method, but this one is preferred because the number of objects is always enormously large in comparison with the number of attributes. Here objects are genes and attributes are the expression levels of the genes under the experimental conditions. Assume we have n genes and p 33 experimental conditions. The gene data matrix X then has n rows and p columns where xy is the expression level of gene i under the condition j and xi is the expression vector (or object vector) of gene i across p experiments. The distance matrix contains the pairwise distance (or dissimilarity) of objects. Specifically, the entry (i,j) in the distance matrix represents the distance between objects i and j, l<i,j<n. The distance of genes i and j can be computed using the object vectors i and j from the object data matrix based on a distance measurement. However, the object data matrix cannot be fully recovered from the distance matrix, especially when the value of p is unknown. 2.3.1 Clustering algorithms Clustering algorithms are classified into either hierarchical or partitioning approaches. While the hierarchical methods group objects into a dendrogram (tree structure) based on the distance among clusters, the partitioning methods directly groups objects into clusters and objects are assigned to clusters based on the criteria of the clustering models. Clustering methods using a partitioning approach are classified into either model-based or heuristic-based approaches. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model obtained from the data then defines the clusters and assigns objects to them. Although K-means is a heuristic-based approach, once a set of K centers that are good representatives for the data is found, we can consider it as a model that generates the data with membership degrees in {0,1} and an addition of noise. In this context, K- Means can be viewed as a model-based approach. 34 2.3.2 Hierarchical clustering algorithms Hierarchical clustering (HC) builds a dendrogram (hierarchical structure) of objects using either agglomerative (bottom-up) or divisive (top-down) approaches. In the former approach, the dendrogram is initially empty; each object is in its own sub- tree (singleton cluster). The clustering process merges similar sub-trees. At each step, the two sub-trees most similar are merged to form a new sub-tree. The process stops when the desired number of sub-trees, say c, at the highest level is achieved. In the top down approach, the dendrogram starts with a sub-tree that contains all the objects. The process then splits the sub-trees into new sub-trees. At each step, the sub-tree of objects with maximum dissimilarity is split into two sub-trees. The process stops when the number of leaves, say c, or the compactness and separation criteria, is achieved. The dissimilarity between every pair of clusters is measured by either the single linkage: L s( v k> v i) = min{d(x 1 ,x J );x 1 e v k ,Xj e vj, (2.6) the complete linkage: L c( v k ? v 1 ) = max{d(x i ,x j );x i ev k ,Xj evj, (2.7) or the average linkage: Ed( Xl , Xj ) L A (v k ,v 1 )= x - eVt r; i l , . (2.8) v k v n The clustering process results in a set of c sub-trees at the level of interest in each approach. In comparison with the partitioning approach, each such sub-tree corresponds to a cluster. 35 Because of their methods of grouping or splitting sub-trees, HC algorithms are considered heuristic-based approaches. The common criticism for standard HC algorithms is that they lack robustness and, hence, are sensitive to noise and outliers. Once an object is assigned to a cluster, it will not be considered again, meaning that HC algorithms are not capable of correcting prior misclassifications and each object can belong to just one cluster. The computational complexity for most HC algorithms is at least 0(n 2 ) and this high cost limits their application in large-scale datasets. Other disadvantages of HC include the tendency to form spherical shapes and reversal phenomenon, in which the normal hierarchical structure is distorted. 2.3.3 K-Means clustering algorithm The K-Means algorithm is a partition approach using a heuristic-based method. Its objective function is based on the square error criterion, J(U, V | X) = JiXJ^ - v k f , where u kl e {0,1}. (2.9) k=l i=l K-Means is the best-known square error-based clustering algorithm. Its objective, J, is based on both compactness and separation factors. Minimizing J is equivalent to minimizing the compactness while maximizing the separation. 36 K-Means algorithm Steps 1) Initialize the partition matrix randomly or based on some prior knowledge. 2) Compute the cluster set V. 3) Assign each object in the dataset to the nearest cluster. 4) Re-estimate the partition matrix using the current cluster set V. 5) Repeat Steps 1, 2 and 3 until there is no object changing its cluster. The K-Means algorithm is very simple and can easily be implemented to solve many practical problems. It works very well for compact and hyper-spherical clusters. The computational complexity of K-Means is O(cn). Because c is much less than n, K- Means can be used to cluster large datasets. The drawbacks of K-Means are the lack of an efficient and universal method for identifying the initial partitions and the number of clusters c. A general strategy for the problem is to run the algorithm many times with randomly initialized partitions, although this does not guarantee convergence to a global optimum. K-Means is sensitive to outliers and noise. Even if an object is distant from the cluster center, it can still be forced into a cluster, thus, distorting the cluster shapes. 2.3.4 The mixture model with expectation-maximization algorithm Under the probability context, objects can be assumed to be generated according to several probability distributions. Objects in different clusters may be generated by different probability distributions. They can be derived from different types of density functions (e.g., multivariate Gaussian or t-distribution), or the same families, but with 37 different parameters. If the distributions are known, finding the clusters of a given dataset is equivalent to estimating the parameters of several underlying models. Denote by P(vk) the prior probability of cluster Vk and ^!LP( v k) = l> the conditional probability of object Xi given clustering partition = {U,V} is, P( Xl |e) = XP( Xi |u k ,v k )xP(v k ). (2.10) k=l Given an instance of 6, the posterior probability for assigning a data point to a cluster can easily be calculated using Bayes' theorem. The mixtures therefore can be applied using any type of components. The multivariate Gaussian distribution is commonly used due to its complete theory and analytical tractability. The likelihood function P(X|6), P(X| 0) = ]Xi P ( x i I ) is the P robabilit y of generating X using the model. The best model , therefore, should maximize the log likelihood function, L(X | 0) = log(P(X | 0)) = ^ n i log(P(x 1 | 0)) , L(X | 0*) = max{L(X | 1 )}. (2.1 1) can be estimated using the expectation-maximization (EM) algorithm. EM considers X a combination of two parts: X° is the observations of X; X M is the missing information of X regarding 0, and X M is similar to U of crisp clustering. The complete data log likelihood is then defined as, L(x|e) = L(x°,x M |e) = XX x riog[p(v k )x P ( x °|e k ]. (2.12) i=l k=l 38 By using an initial value of 0, 0°, EM searches for the optimal value of by generating a series of estimates {0°, 1 ,..., T }, where T represents the reaching of the convergence criterion. EM algorithm Steps 1) Randomly generate a value of 0, 0°, set t=0. 2) E-step: Compute the expectation of the complete data log likelihood, QCM^ECLCXIG)). 3) M-step: Select a new parameter estimate that maximizes Q(.), t+1 =max e { (3(0,0') }• 4) Set t=t+l. Repeat Steps 2 and 3 until the convergence criterion holds. EM assumes that the data distribution follows specific multivariate distribution model, for example, a Gaussian distribution. The major disadvantages of the EM algorithm are the sensitivity to the selection of the initial value of 0, the effect of a singular covariance matrix, the possibility of converging to a local optimum, and the slow convergence rate. In addition, because most biological datasets do not follow a specific distribution model, EM may not be appropriate. 2.3.5 Fuzzy clustering algorithm The clustering techniques we have discussed so far are referred to as hard or crisp clustering, which means that each object is assigned to only one cluster. For fuzzy clustering, this restriction is relaxed, and the object xi can belong to cluster Vk, k=l,. . .,c, 39 with certain degrees of membership, Uki, Uki e[0,l]. This is particularly useful when the boundaries among the clusters are not well separated, and are therefore ambiguous. Moreover, the memberships may help us discover more sophisticated relationships between a given object and the disclosed clusters. Fuzzy C-Means (FCM) is one of the most popular fuzzy clustering algorithms. FCM attempts to find a fuzzy partition for a set of data points while minimizing the objective function as (2-7). By using afuzzifier factor, m, FCM is flexible in managing the overlap regions among clusters, hence n c X y X y „ m II j(u,v|x) = 2,2 J u k n ;|x l -v k || i=l c=l FCM algorithm Steps 1) Randomly initialize the values of partition matrix U°, set t=0. 2) Estimate the value of v 1 using ZKr*. k n (2.14) i=l 3) Compute the new values of U t+1 that maximally fit with v 1 . X; - V, m-1 M X i _V i VII l J ll J m-1 4) Repeat Steps 2 and 3 until either {vjj or {u.i} is convergent. (2.13) (2.15) 40 As with the EM and K-Means algorithms, FCM performs poorly in the presence of noise and outliers, and has difficulty with identifying the initial parameters. To address this issue, FCM is integrated with heuristic-based search algorithms, such as GA, Ant colony (AC), and Particle swarm optimization (PSO). The mountain method or Subtractive clustering can also be used as an alternative to search for the best initial value of V of the FCM algorithm. Due to Possibility Theory limitations, it is difficult to recover data from a fuzzy partition model. However, recent research has shown that it is possible to construct a probability model using the possibility one. Therefore, by using a possibility to probability transformation, it is possible to recover the data from the fuzzy partition model. The FCM algorithm is considered to be model-based algorithm. The FCM algorithm has advantages over its crisp/probabilistic counterparts, especially when there is a significant overlap between clusters [57-59]. FCM can converge rapidly and provides more information about the relationships between genes and groups of genes in the cluster results. We therefore choose the FCM algorithm for the cluster analysis of gene expression data. However, the FCM algorithm also has some drawbacks. The objective function can help to estimate the model parameters, but cannot distinguish the "best" solution from the numerous possibly local optimum solutions. The lack of an effective cluster validation method for FCM prevents its application to real world datasets, where the number of clusters is not known. FCM is well known for its rapid convergence, but this is not guaranteed to reach the global optimum. Because of these issues, further analysis with FCM will be carried out in Chapter 3 that will help us develop a novel method. 41 2.4 Gene regulatory sequence prediction The cluster analysis process produces groups of genes with similar expression patterns. Genes in each group may also have similar regulatory sequences. Understanding global transcriptional regulatory mechanisms is one of the fundamental goals of the post-genomic era [69]. Conventional computational methods using microarray data to investigate transcriptional regulation focus mainly on identification of transcription factor binding sites. However, many molecular processes contribute to changes in gene expression, including transcription rate, alternative splicing, nonsense- mediated decay and mRNA degradation (controlled, for example, by miRNAs, or RNA binding proteins). Thus, computational approaches are needed to integrate such molecular processes. miRNAs bind to complementary sites within the 3 f -UTRs of target mRNAs to induce cleavage and repression of translation [175]. In the past decade, several hundred miRNAs have been identified in mammalian cells. Accumulating evidence thus far indicates that miRNAs play critical roles in multiple biological processes, including cell cycle control, cell growth and differentiation, apoptosis, and embryo development. At the biochemical level, miRNAs regulate mRNA degradation in a combinatorial manner, i.e., individual miRNAs regulate degradation of multiple genes, and the regulation of a single gene may be conducted by multiple miRNAs. This combinatorial regulation is thought to be similar in scope to transcription factor partner regulation. Therefore, in addition to transcription factors and other DNA/RNA-binding proteins, comprehensive 42 investigations into transcriptional mechanisms underlying alterations in global gene expression patterns should also consider the emerging role of the miRNAs. Motif finding algorithms are designed to identify transcription factor binding sites (TFBS) and miRNA complementary binding sites (MBS). These sequences are usually short, ~ 8bp and 23bp, respectively. In addition, each gene may have zero, one, or more such binding sites (BS). These BSs or motifs may be over-represented differently in DNA and RNA sequences of genes in the same group. Motifs are difficult to recognize because they are short, often highly degenerate, and may contain gaps. Although motif- finding is not a new problem, challenges remain because no single algorithm both describes motifs and finds them effectively. Popular methods for motif-finding use a position specific score matrix (PSSM) that describes the motifs statistically, or consensus strings that represent a motif by one or more patterns that appear repeatedly with a limited number of differences. Of the two, PSSM is preferred because it is more informative and can easily be evaluated using statistical methods. In addition, a consensus motif model can be replaced by a PSSM one. One of the most popular motif- finding algorithms using PSSM is MEME proposed by Bailey and Elkan [70]. The advantage of MEME is that it uses expectation- maximization (EM) which, as a probability-based algorithm, produces statistically significant results if it can reach a global optimum. The disadvantage is that for motif seeds, it uses existing subsequences from within the sequence set and, as a result, may fail to discover subtle motifs. Chang et al. [72] and Li et al. [76] overcame this drawback by using the genetics algorithm (GA) to generate a set of motif seeds 43 randomly. However, because they used GA with random evolution processes, a rapid convergence to a solution is not assured. Li et al. [77] improved on MEME by using one instance of a position weight matrix (PWM), a type of PSSM, to represent a motif, and statistical tests to evaluate the final model. However, because EM may converge to local optima, use of a single PWM may fail to find a globally optimal solution. The GA- based methods of Wei and Jensen [83] and Bi [71] use chromosomes to encode motif positions. The method in the former [83] is appropriate for models with zero to one occurrence of the motif per sequence (ZOOPS); the latter [71] is appropriate for models with one occurrence per sequence (OOPS), because one variable can be used to represent the single motif occurrence in each sequence. However, Li et al. [77] recently showed that ZOOPS and OOPS are inadequate when not every sequence has the same motif frequency, and that the two-component mixture (TCM) model, which assumes a sequence may have zero or multiple motif occurrences, should be used. However, TCM requires a set of variables for every sequence to manage motif positions, and hence, the size of a chromosome can approach the size of the dataset [145]. Lastly, the above algorithms are restricted to finding gapless motifs and, therefore, will fail to find many functionally important, gapped motifs. While some methods, e.g., pattern-based methods of Pisanti et al. [81] and Frith [73], allow gapped motifs, they require the gapped patterns to be well-defined and they generate gap positions randomly or by using a heuristic method. Alternatively, Liu et al. [78] used neural networks to find gapped motifs, but their approach required a limited and specific definition of the neural network structure. 44 2.5 Datasets In order to evaluate the previous studies as well as our proposed methods for microarray analysis, we will use multiple data types with different levels of complexity. Artificial datasets are used in both clustering and motif-finding and are preferred for testing methods because of the ease in generating and running benchmarks. For cluster analysis, we generated datasets using a finite mixture model of data distribution [27], because known cluster structures are ideal for evaluating the capability of an algorithm for determining the number of clusters and the cluster prototypes. For sequence analysis, we generated motifs in the set of randomly generated DNA sequences using the normal distribution model [71] with a known number of motif occurrences and locations in every sequence. Some popular datasets in the Machine Learning Repository, University of California Ivrine (UCI) [84], are also used to show how effective our methods are for cluster analysis using real datasets. The most complex datasets, e.g., gene expression data, including CMAP datasets, are the most challenging of clustering problems. Lastly, for sequence analysis, we used eight DNA transcription factor binding site datasets; two eukaryotic datasets: ERE and E2F [71], and six bacterial datasets: CRP, ArcA, ArgR, PurR, TyrR and IHF [80]. 2.5.1 Artificial datasets for cluster analysis Artificial datasets were generated using a finite mixture model of data distribution [27] with different numbers of data objects, clusters, and dimensions. The clusters were generated with a slight overlap (overlapping ratio = 0.1). For test purposes, we selected five datasets, named from ASET1 to ASET5, where the first four datasets are uniform 45 and the last one is non-uniform. Datasets of different dimensions support the stability of our methods. We expect our methods to properly evaluate clustering results, successfully detect the correct number of clusters, properly locate the cluster centers, and to correctly assign data points into their actual clusters. 2.5.2 Artificial datasets for sequence analysis We generated simulated DNA datasets as in Bi [71], with three different background base compositions: (a) uniform, where A, T, C, G occur with equal frequency, (b) AT-rich (AT = 60%), and (c) CG-rich (CG = 60%). The motif string, GTCACGCCGATATTG, was merged once or twice into each sequence, after a defined level of the string change: (i) 9% change representing limited divergence (i.e. 91% of symbols are identical to the original string), (ii) 21% change, or (iii) 30% change which is essentially background or random sequence variation. 2.5.3 Real clustering datasets Some popular real datasets from the UCI Machine Learning Repository [84] were used to test our method. Iris data The Iris dataset contains information about the sepal length, the sepal width, the petal length, and the petal width of three classes of Iris flowers; setosa, versicolor and virginica. Each class has 50 data objects from a total of 150 data objects in the dataset. This dataset does not contain missing values. The expected results should have the three classes with their own data objects. 46 Wine This dataset contains 178 data objects, each with 13 attributes corresponding to the measures of alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines and proline. The data objects are known to belong to three different classes. There are no missing values in the dataset. We expect to detect three clusters in the dataset and properly assign the data objects to their classes. Glass This dataset contains 214 data objects, each having ten attributes of which nine are the glass characteristics of refractive index, sodium, magnesium, aluminum, silicon, potassium, calcium, barium, and iron content. The last attribute, a value ranging from 1 to 6, is for glass identification. The dataset with the first nine attributes is used for clustering and the results should contain six clusters and assign the data objects to their proper class identification. Breast Cancer Wisconsin The Breast Cancer Wisconsin (BCW) dataset contains 699 data objects with nine attributes regarding the clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. The objects of the datasets are from two different classes of tumor state: benign and malignant. The number of classes in this dataset is unknown. Recent research has concluded that 6 is a reasonable number of classes. We expect to discover 6 clusters in the dataset using our methods. 47 2.5.4 Gene expression datasets Yeast and Yeast-MIPS data The yeast cell cycle data showed expression levels of approximately 6000 genes across two cell cycles comprising 17 time points [85]. By visual inspection of the raw data, Cho et al. [85] identified 420 genes that show significant variation. From the subset of 420 genes, Yeung et al. [102] selected 384 genes that achieved peak values in only one phase, and obtained five standard categories by grouping into one category genes that peaked during the same phase. Among the 384 selected genes, Tavazoie et al. [118], through a search of the protein sequence database, MIPS [86], found 237 genes that can be grouped into 4 functional categories: DNA synthesis and replication, organization of centrosome, nitrogen and sulphur metabolism, and ribosomal proteins. The functional annotations show the cluster structure in the dataset. We named these two subsets Yeast and Yeast-MIPS respectively. We expect clustering results to approximate the five class partition in the former and a four class partition in the latter. RCNS - Rat Central Nervous System data The RCNS dataset was obtained by reverse transcription-coupled PCR designed to study the expression levels of 1 12 genes over nine time points during rat central nervous system development [87]. Wen et al. [87] classified these genes into four functional categories based on prior biological knowledge. These four classes are external criterion in this dataset. 48 2.5.5 CMAP datasets CMAP datasets contain expression profiles from multiple drug treatments performed at different concentrations. Originally there were 164 distinct small- molecules used in 453 different treatment instances resulting in 564 expression profiles. These datasets have been used to detect associations between genes and drugs. Lamb et al. [25] and Zhang et al. [26] were successful in recovering and exploiting some gene- drug associations using gene expression signatures with matching-and-ranking based methods. These datasets will be used in our approach to demonstrate that we can replicate the results of Lamb et al. [25] and Zhang et al. [26]. In addition, more gene expression signatures will be discovered and the associations between DNA motifs and drugs will be established using our methods. By discovering novel gene expression signatures and related pathways for the drugs, we expect to exploit the desired effects as well as the side-effects of drug treatments. Additionally, by using our motif-finding methods, we hope to predict some of the associations between drugs and DNA sequences and the connections between miRNAs and diseases. 2.5.6 Real biological sequence datasets A. Eukaryotic transcription factor binding sites datasets E2F dataset E2F transcription factors play a key role in the regulation of cell cycle-associated genes. Identifying target genes for these factors is a major challenge. We used the set of 25 sequences from Kel et al. [35] that contain 27 instances of the E2F transcription 49 factor binding site. The goal is to discover motif occurrences that locate exactly or overlap with binding sites at known positions. ERE dataset The estrogen receptor (ER) is a ligand-activated enhancer protein that is a member of the steroid/nuclear receptor superfamily. Two genes encode mammalian ERs, ERalpha and ERbeta. In response to estradiol (E2), ER binds with high affinity to specific DNA sequences called estrogen response elements (EREs) and transactivates gene expression. We used the set of 25 sequences from Klinge [14] each of which contains an instance of the ERE. The purpose is to detect the motif sequences located exactly at or overlapping with EREs. B. Bacterial transcription factor binding sites datasets CRP dataset CRP is one of seven "global" transcription factors in E. coli known to regulate more than 100 transcription units. CRP's activity is triggered in response to glucose starvation and other stresses by the binding of the second messenger cAMP. CRP binding sites have proved to be particularly noisy because computational searches for consensus binding sites have missed many known binding sites. CRP was chosen for its highly indiscriminate binding site [115]. ArcA dataset ArcA is a global regulator that changes in relation to the expression of fermentation genes and represses the aerobic pathways when E. coli enters low oxygen growth 50 conditions. Arc A was chosen for its different protein domain (CheY-like) and very low consensus binding site [115]. ArgR dataset ArgR, complexed with L-arginine, represses the transcription of several genes involved in biosynthesis and the transport of arginine, histidine, and its own synthesis, and activates genes for arginine catabolism. ArgR is also essential for a site-specific recombination reaction that resolves plasmid ColEl multimers to monomers and is necessary for plasmid stability. PurR dataset PurR dimers control several genes involved in purine and pyrimidine nucleotide biosynthesis and its own synthesis. This regulator requires binding of two products of purine metabolism, hypoxanthine and guanine, to induce the conformational change that allows PurR to bind to DNA. TyrR dataset TyrR, tyrosine repressor, is the dual transcriptional regulator of the TyrR regulon that involves genes essential for aromatic amino acid biosynthesis and transport. TyrR can act both as a repressor and as an activator of transcription at a 70 -dependent promoters. IHF dataset IHF, Integration Host factor, is a global regulatory protein that helps maintain DNA architecture. It binds to and bends DNA at specific sites. IHF plays a role in DNA 51 supercoiling and DNA duplex destabilization and affects processes such as DNA replication, recombination, and the expression of many genes. These sequence sets were chosen because of their highly degenerate binding sites. C. Protein sequences Sequences from protein families in PFAM and Prosite databases with known motifs are used to evaluate the performance of our method. The DUF356, Strep-H-triad and Excalibur (extracellular calcium-binding) protein families have unambiguous motifs. Those from Nup-retrotrp and Flagellin-C families have weak signals. Sequences from these families are also used to evaluate the capability of detecting subtle motifs. In addition to these families, sequences from Xin Repeat and Planctomycete Cytochrome C protein families are used because they have more variable motifs. For gapped motif examples, we used sequences from the zf-C2H2, EGF2, LIGCYCLIN1, LIG_PCNA and MODTYRITAM families. To test the ability of de novo motif prediction, we used sequences from ZZ, Myb and SWIRM domains. Because motifs of these domains are unknown, the results from our method are compared with other motif-finding algorithms and we expect that our methods will agree with the results of others as shown in Chapter 4. 52 3. Fuzzy cluster analysis using FCM 3.1 FCM algorithm The FCM algorithm was initially developed by Dunn [121] and generalized later by Bezdek [92] with an introduction of the fuzzifier, m, is the best well-known method for fuzzy cluster analysis. FCM allows gradual memberships of data points to clusters measured as degrees in [0,1]. This provides the flexibility in describing the data points that can belong to more than one cluster. In addition, these membership degrees provide a much finer picture of the data model; they afford not only an effective way of sharing a data point among multiple clusters but also express how ambiguously or definitely the data point belongs to a specific cluster. Let X = {xi, X2,..., x n } be the dataset in a p-dimensional space R p where each xi = (xn, Xi2,..., Xi P ) g R p is a feature vector or pattern vector, and each Xy is the j th characteristic of the data points xi. Let c be a positive integer, 2 < c < n, R cn denote the vector space of all real c x n matrices. Let V = {vi, V2,. • ., v c } e R cp of which Vk e R p is the center or prototype of the k th cluster, 1 < k < c. The concept of the membership degrees is substantiated by the definition and interpretation of fuzzy sets. Fuzzy clustering, therefore, allows got a fine grained solution space in the form of fuzzy partitions of the dataset. Each cluster Vk in the cluster partitions is represented by a fuzzy set |Lik as follows: complying with fuzzy set theory, the cluster assignment Uki is now the membership degree of the data point xi to cluster Vi, such that u^ = |Uk(xO e [0,1]. Because memberships in clusters are fuzzy, 53 there is not a single cluster label assigned to each data point. Instead, each data point xi will be associated by a fuzzy label vector that states its memberships in c clusters, u .i=( u ii> u 2i>---> u ci} T - C 3 - 1 ) Definition 3.1 (fuzzy c-partition): Any matrix U = {uki} e M fcn defines a fuzzy c- partition (or fuzzy partition), where M fcn = |u e R cn | u kl e [0,1] Vi,k;Ju kl = 1 Vi;0 < ]Tu kl < n Vkj . (3.2) I k=l i=l J Row k 1 of a matrix U e M fcn exhibits the k th membership function Uk in the fuzzy partition matrix U. The definition of a fuzzy partition matrix (3.2) restricts assignment of data to clusters and what membership degrees are allowed. Definition 3.2 (fuzzy c -means Junctionals): Let J m : M cn x R cp — > R + be, J m (U,V) = XXuSd 2 (x i ,v k )^min, (3.3) i=l k=l where m is the fuzzifier factor, l<m<oc, also called the weighting exponent, which controls the degree of fuzziness of the clustering; J m (U,V) is the infinite family of objective functions, and d(.) is any inner product norm of R p : d 2 (x„v k )= x,-v k \ =( X] -v k ) T A(x 1 -v k ), (3.4) 9 9 where A is a positive definite matrix. If A = I pxp then d (x,y) = ||x - y|| . Theorem 3.1 (prototypes oj FCM): If m and c are fixed parameters, and I, I are sets defined as: 54 Vi,l<i<n fl i ={k|l<k<c;d(x i ,v k ) = 0}, (3.5) _] i ={l,2,...,c}/I i then (U,V) e (M fcn x R cp ) may be a global minimum for J m (U,V) with respect to (w.r.t) the restriction I2>« =1 ' l<i<n k=l as in (3.2), only if: and V u ki = <^ l<i<n kl l<k<c 1 v d 2 ( X „v k ) y [0, k £ I, ll, kel ' m-1 / c 1=1 d 2 (x i5Vl ) y m-1 I i= L*0 v v k =z< x 7z<- l<k<c i=l i=l (3.6) (3.7) (3.8) Proof. We consider minimizing the J m w.r.t. U under the restriction (3.6) using Lagrange multiplier. Let the Lagrange multiplier be A,i, i=l,. . .,n and put L=II<d 2 (x l ,v k )-2> 1 £> kl -i). i=l k=l i=l k=l As the necessary optimal condition, 8L du Vi = m(ur)d 2 (x i ,v k )-X 1 =0, (3.9) 55 dL dv = 22uS|x I -v k | = 0. (3.10) k i=l Assume that k £ l{ then d (xi, Vk) * 0. We derive from the Equation (3.9) for each k: i i K md 2 (x ; ,v k ) y m_i i% i m_i K m) V d2 ( x .' v k)y m-l l<k<c. (3.11) Summing up for k=l,. . .,c and taking (3.6) into account, we have, X : V" 1 ' z«h- z k=l V m / k=l 1 v d (x.,v k ), = 1. Hence, 00 m m-l Viny z k=l 1 d 2 (x ; ,v k ) m-l Together with (3. 1 1), we obtain the following for the membership, u ki = 1 ^ 1=1 V-i v d 2 ( Xi ,v k ) m-l v d 2 ( Xl , Vl ) y Thus, u kl = v d2 ( x .> v k)y m-l / c 1=1 ^m-1 v d 2 (x 1 ,v 1 ) y Similarly, we derive from (3.10) for each k, l<k<c: v k =5>s*,/5>s i=l i=l Q.E.D 56 3.2 FCM convergence The iteration process of the FCM algorithm can be described using a map: 3 m : (U, V) -> (U, V) where U = F(V) and V = G(U) . (U w , V w ), t=l,2... is called an iteration sequence of the FCM algorithm. If (U (t) ,V (t) ) = 3 m (U (t - l \V (t - l) ), t>l, where (U (0) ,V (0) ) is any element of M fcn x R cp . Set a = I(u\ y% m- x R - J >\V)^ J >, v> ) vu - m-,u * v'M 1 J,(U.V)<J„(U*,V) W€R',V*V* Theorem 3.2 (descent function - Bezdek) [133]: J m is a descent function for {3 m , Q} Proof: Because {y -^ d 2 (y) } and {y -> y m } are continuous, J m is the sum of products of such functions, J m is continuous on {M fcn , R cp }. J m (3 m (U,V)) = J m (F(G(U)), G(U)) < J m (U,G(U)) by the first case in the definition of Q < J m (U, V) by the second case in the definition of Q. Thus, J m is a descent function for {3 m , Q}. Q.E.D Theorem 3.3 (solution convex set): Let [conv(X)] c be the c-fold Cartesian product of the convex hull of X, and let (U°, V°) be the starting point of the sequence of 3 m iteration, U° e M fcn and V° = G(U°). Then 3 m (U (t " 1} ,V (t " 1} ) e M fcn x [conv(X)] c , 1=1,2. . . M fcn x[conv(X)] c is compact in M fcn xR cp . Proof Let U° e M fcn be chosen. For each k, 1 < k < c, we have 57 i=l / i=l Let p kl = (KiT/ll( u liT > we have v k ^PA , and £p ki = 1 i=l i=l i=l Thus v° k g conv(X) , and therefore V° e [conv(X)] c . Subsequently, we can prove that U (t) g M fcn , V (t) g [conv(X)] c , Vt > 1. Q.E.D Theorem 3.4 (the convergence of FCM - Bezdek et al.) [134]: Let 0° = (U°, V°) be the starting point of the iterations with 3 m . The sequence 3^(U 5 V°) 5 t = 1,2... either terminates at an optimal point = (U , V ) g Q, or there is sub-sequence converging to a point in Q. Proof: 3 m is a continuous and descent function (Theorem 3.2) and the iteration sequences are always in a compact subset of the domain of J m (Theorem 3.3). Thus, 3 m should terminate at an optimal point = (U ,V ) or there is a sub-sequence converging to a point in Q. Q.E.D 3.3 Distance measures Similar to other clustering algorithms, FCM uses the distance or dissimilarity of data points, each described by a set of attributes and denoted as a multidimensional vector. The attributes can be quantitative or qualitative, continuous or discrete, which leads to different measurement mechanisms. Accordingly, for a dataset in the form of an object data matrix, the data matrix is designated as two-mode, because its row and column indices have different meanings. This is in contrast to the one-mode distance 58 matrix, which is symmetric with elements representing the distance or dissimilarity measure for any pair of data points, because both dimensions share the same meaning. Given a dataset X, the distance function d(.) on X is defined to satisfy the following conditions, Symmetric: d(x,y) = d(y,x), Positive: d(x,y) > 0, V x,y eX. If the following conditions hold, then d is a metric. Triangle inequality: d(x,z) < d(x,y) + d(y,z), Vx,y,zeX, Reflexive: d(x,y) = iif x = y. If the triangle inequality is violated, then d is a semi-metric. Many different distance measures have been used in cluster analysis. Euclidean distance The Euclidean distance is the most widely used distance measure and defined as d 2 (x,y) = X(x 1 -y 1 ) 2 . (3.12) i=l This distance is a true metric because it satisfies the triangle inequality. In the Equation (3.12), the expression data Xi and yi are subtracted directly from each other. We, therefore, need to ensure that the expression data are properly normalized when using the Euclidean distance, for example by converting the measured gene expression levels to log-ratios. 59 Pearson correlation distance The Pearson correlation coefficient distance is based on the Pearson correlation coefficient (PCC), which describes the similarity of objects as PCC(x ? y)^f (X --" l)(y --^ ) P !=1 °x°y where a x and a y are the sample standard deviation of x and y respectively. PCC has a value from -1 to 1, where PCC=1 when x and y are identical, PCC=0 when they are unrelated, and PCC = -1 when they are anti-correlated. The Pearson's correlation distance is then defined as d (x,y) = l-PCC(x,y). (3.13) Euclidean Correlation Figure 3-1 : Expression levels of three genes in five experimental conditions The value of d p lies in [0,2] where d p =0 implies that x and y are identical, and d p =2 implies that they are very different from each other. While the Euclidean measure takes the magnitude of the data into account, the Pearson correlation measure geometrically captures the pattern of the two data points [1]. It is therefore widely used for gene expression data. Even if the expression levels of the two genes are different, if 60 they peak similarly in the experiments, then they have a high correlation coefficient, or a small Pearson correlation distance (Figure 3-1). Because of its definition, the Pearson correlation distance satisfies the distance conditions, but it is not a metric since it violates the triangle inequality. Absolute Pearson correlation distance The distance is defined as, d ap (x,y) = l-|PCC(x,y)|. (3.14) Because the absolute value of PCC falls in the range [0,1], the d ap distance also falls between [0,1]. d ap is equal to 1 if the expression levels of the two genes have the same shape, i.e., either exactly the same or exactly opposite. Therefore, d ap (3.14) should be used with care. Uncentered Pearson correlation distance The distance is based on the uncentered Pearson correlation coefficient (UPCC) which is defined as UPCC(x,y) = ±£^ P i=l °x ^y where o° z = ^ zf /p . The uncentered Pearson correlation distance is defined as i=i d up (x,y) = l-UPCC(x,y). (3.15) The d up distance in Equation (3.15) may be appropriate if there is a zero reference state. For instance, in the case of gene expression data given in terms of log- 61 ratios, a log-ratio equal to corresponds to green and red signal being equal, which means that the experimental manipulation did not affect the expression. Because UPCC lies in the range [-1,1], the d up distance falls between [0,2]. Absolute uncentered Pearson correlation distance This distance is defined as in (3.16). Because the absolute value of UPCC falls between [0,1], the d aup distance also falls between [0,1]. d aup (x,y) = l-|UPCC(x,y)|. (3.16) Comparison of measures The distance measure plays an important role in obtaining correct clusters. For simple datasets, where the data is multidimensional, the Euclidean distance measure is employed. But as the dimensions of the dataset increase, where each dimension denotes a specific aspect of the dataset, the Euclidean distance measure may not be the best one to be used. For the Iris dataset, we ran the FCM algorithm with the number of clusters set to 3, which is the number of classes of Iris flowers: Setosa, Versicolor, and Virginica in the dataset. By using five different distance measures from (3.12) to (3.16), we obtained five different cluster partitions. For each partition, we compared the cluster label of each data point with its class label to compute the correctness ratio for every class. The results are shown in Table 3.1. While the four distance measures using Pearson's correlation coefficient provide correct results, the Euclidean distance measure performs worse, particularly on the Virginica class. In this comparison, it is clear that the absolute 62 Pearson correlation measure and the Pearson correlation measure results are identical. Similarly, the result of the uncentered Pearson correlation measure is identical with that of the absolute uncentered Pearson correlation measure. Table 3.1 : Performance of different distance measures on the Iris dataset Distance measure method Classification correctness (%) Average Correctness Setosa Versicolor Virginica Euclidean 100.00 94.00 76.00 90.00 Pearson correlation 100.00 94.00 94.00 96.00 Abs. Pearson correlation 100.00 94.00 94.00 96.00 Uncentered Pearson cor. 100.00 92.00 100.00 97.33 Absolute uncentered Pearson cor. 100.00 92.00 100.00 97.33 Table 3.2 : Performance of different distance measures on the Wine dataset Distance measure method Classification correctness (%) Average Correctness #1 #2 #3 Euclidean 76.27 70.42 56.25 67.65 Pearson correlation 91.53 85.92 100.00 92.48 Abs. Pearson correlation 72.88 67.61 95.83 78.77 Uncentered Pearson cor. 91.53 85.92 97.92 91.79 Absolute uncentered Pearson cor. 91.53 85.92 97.92 91.79 We ran a similar benchmark on the Wine dataset which contains data objects of the three different classes of wines. Results are shown in Table 3.2. Again, all Pearson correlation based distance measures performed better than the Euclidean distance measure. The Pearson correlation measures seem to be better for distance measure in real and unsealed datasets. 63 3.4 Partially missing data analysis In cluster analysis, complete information is preferred throughout the experiment. Unfortunately, real world datasets frequently have missing values. This can be caused by errors that lead to incomplete attributes or by random noise. For example, sensor failures in a control system may cause the system to miss information. For gene expression data, the missing values can come from the platform level and meta-analysis level. With the former, the reasons include insufficient resolution, image corruption, spotting, scratches or dust on the slide, or hybridization failure. The latter can be caused by differences in platforms or chip generations, and the lack of an appropriate method to map probes onto genes across different microarray platforms. In a dataset with partially missing data, some attribute values of a data point xi may be not observed. For example, xi = (xn, Xi 2 , ?, ?, Xi 5 , ?) has missing values corresponding to the third, fourth and sixth attributes and only the first, second and fifth attributes are observed. This causes a problem with distance measurement between xi and other data points in the dataset and the cluster algorithm, therefore, cannot perform properly at xi. Let Xw = { xi g X| xi is a complete data point }, X P = { xij | Xi^ ? }, and X M = { Xij | Xij = ? } . Clearly, X P and X contain more information than X w [140]. 64 Three approaches [137-139] have been widely used to address the problem of missing values. (1) The ignorance-based approach is the most trivial approach to deal with datasets when the proportion of incomplete data is small, |X / X w | (( |X|, but the elimination brings a loss of information. (2) The Model-based approach defines a model for the partially missing data, X M , using available information, X P , and applies the model to build a complete dataset. However, the complexity of this method can prevent application to large datasets. (3) The Imputation-based approach supplies missing values X M by certain means of approximation. Of the three, the Imputation-based approach is preferred. Because of the need to use an approximation model for the missing data, it is therefore integrated with the cluster algorithm and works based on the cluster model, which may be a probability-based or possibility-based one, estimated by the cluster algorithm. For the Imputation-based approach using FCM, Hathaway et al. [140] identified four strategies to solve the problem of missing values. Whole data strategy (WDS) This strategy, similar to the ignorance-based approach, eliminates the missing data. Results therefore present the cluster partition of X w instead of X. This strategy is applicable when |X P | / |X| > 0.75. Partial distance strategy (PDS) This is the strategy recommended when X M is large. The distance between every pair of data points is computed using X P . 65 d^(x i ,v k ) = |x i -v k | AIi =(x i -v k ) 1 (AI i )(x i -v k ), (3.17) where Ii is the index function of the data point xi, defined as in (3.18). fO, x, eX M h=\: " V M forl<i<n,l<t<p. (3.18) [1, x it eX p Optimal completion strategy (OCS) In addition to model parameters, this strategy estimates X M . The FCM algorithm is modified to meet this method: in the first iteration, X w is used instead of X, and X M is estimated in the favor of optimizing the J m function using (3.19). FCM is then run using the entire dataset X until convergence. I< ; XV,, *,=—< ,x lje X M . (3-19) k=l Nearest prototype strategy (NPS) The nearest prototype strategy is in fact a version of the k-nearest neighbors strategy [89] for fuzzy partition. This is also considered the simple version of OCS. In each iteration, X M is updated as, x u =v kj , where x {j eX M ,dJ(x i ,v) = mjdJ(x i ,v 1 )). (3.20) Algorithm performance measures Hathaway et al. [140] proposed two measures for performance evaluation of imputation algorithms. The misclassification measure (MISC) is computed based on the assigned cluster label and the actual class label of the data objects, defined as, 66 MISC(X,V) = £l m (C(x i ),V(x i )), (3.21) i=l where C(xi) and V(xi) are the actual class and the assigned cluster labels of the data object xi respectively. I m is the index function, defined as, fl, x^y i m (x,y)= : y . [0, x = y The cluster prototype error (PERR) measures the difference between the predicted and the actual cluster prototypes of the dataset; it is defined as, PERR(V) = r c A Z d ( v k> v I) 7 2 (3.22) Vk=l where Vk, the k th predicted cluster prototype, and v^ , the k th actual cluster prototype, k=l,...,c. We applied the four strategies discussed above to the Iris dataset [140] using different scales of the missing values, X M . The results are shown in Table 3.3. The dataset contains 150x4 = 6000 attribute values. X M was generated randomly at various scales from to 0.75. Each method was run 50 times. In Table 3.3, columns 2-5 show the average number of iterations to convergence, columns 3-6 show the misclassification errors, and columns 7-10 show the mean prototype errors, computed based on the distances between the estimated prototypes and the real prototypes of the dataset for the four methods. Based on the errors in columns 7-10, the accuracy of prototype estimation by OCS was always at least as good as that of other approaches. The WDS performed well for datasets with low percentages of missing values. The second best method was NPS. 67 Table 3.3 : Average results of 50 trials using incomplete Iris data [140] % Missing Mean Number of Iterati to Termination ons Mean Number of Misclassifications (Training Error) Mean Prototype Error WDS PDS OCS NPS WDS PDS OCS NPS WDS PDS OCS NPS I 1 1 1 16 16 16 16 0.00 0.00 0.00 0.00 10 18.18 15.64 22.20 16.00 16.04 16,24 16.28 16.28 0.17 0.07 0.07 0.07 20 20.74 17,24 22,64 18.12 16.42 16.62 16.48 16.54 0J3 0.12 0.11 0.12 30 22.68 17.84 29.90 20.42 19,00 17.04 16.72 16.98 0.49 0.15 0.14 0.15 40 23,62 18.48 29.02 22.40 22.46 18.76 18.82 18.84 0,78 0.18 0.17 0.18 50 14,80 20.14 33-06 29.08 42.64 2L44 21,14 21.36 2.11 0-25 0.21 0-24 15 N/A 47,20 68,92 7078 N/A 33.44 29.12 30.54 N/A 0.74 0.39 0,52 3.5 FCM Clustering solution validation A fundamental and the most difficult problem in cluster analysis, cluster validation, is to determine whether the fuzzy partition is of good quality. Cluster validation addresses the following questions: i. If all existing algorithms are applied to the dataset, and one obtains a multitude of different partitions, which assignment is correct? ii. If the number of clusters is unknown, partitions can be determined for different numbers of clusters. Which partition (or number of clusters) is correct? iii. Most algorithms assume a certain structure to determine a partition, without testing if this structure really exists in the data. Does the result of a certain clustering technique justify its application? The cluster validation problem is the general problem of determining whether the underlying assumptions (cluster shapes, number of clusters...) of a clustering algorithm are satisfied for the dataset. There are three different approaches for cluster validation: 68 a) Using hypothesis testing on the clusters, b) Applying model selection techniques to the cluster model, c) Using cluster validity measures. In the first two methods, statistical tests with assumption on probabilistic data distribution model are used. Hence, cluster validation is performed based on the assumed distribution. While the first approach, using hypothesis testing, is applied to a single model that we consider the local validation approach (3.5.2), the approach using model selection technique, similar to the global validation approach (3.5.1), takes into account a family of probabilistic models according to a set of partitions from which the best should be selected. FCM is a possibility-based approach. It is similar to the EM algorithm with the mixture model, which is a probability-based approach. Compared with FCM, EM, with a standard model assumption about the data distribution, has the disadvantage that it converges slowly. However, EM, because of its assumed statistical model, has the advantage of using the first two approaches to evaluate the cluster results. In contrast to EM, these approaches are useless for FCM. Therefore, most methods for fuzzy partition validation are based on the third approach, where cluster validity measures are used. Lacking certain knowledge about the data, it has to be determined for each problem individually which cluster shape is appropriate, what distinguishes good clusters from bad ones, and whether there are data lacking any structure. There are two common approaches to create cluster validity measures. 69 3.5.1 Global validity measures Global validity measures are mappings g: A(D,R) — » R describing the quality of a complete clustering partition using a single real value. The objective function, J m in (3.3) is a simple example of such a validity measure. When we determine the cluster parameters and memberships, we minimize the J m function. However, the objective function is a poor choice for creating validity functions; it is obvious that lim(J m ) = 0, but c = n is not the optimal number of clusters for all datasets. In recent literature, a number of cluster validity measures have been developed and have addressed the problem for specific situations [92-97, 121-131]. Definition 3.3 (Partition coefficient - PC, Bezdek): Let Af(R p , R cp ) be an analysis space, X c: R p , and U: X^Vg R cp a fuzzy cluster partition. The partition coefficient PC of the partition U is defined by v pc =1 Z2X- ( 3 - 23 ) n i=l k=l The inequality — < V PC < 1 holds because of (3.2) and (3.6). Vpc = 1 for all hard c partitions. If the fuzzy partition contains no information, each object is assigned to each cluster with the same degree and the minimum value 1/c of Vpc is obtained. A maximum value of Vpc should be considered when making a selection between multiple fuzzy partitions, i.e., V PC (U t ) = max V PC (U k ). The optimal partition, U op t, is the most l<k<c "unambiguous" one regarding cluster assignment of the objects. 70 Definition 3.4 (partition entropy - PE, Bezdek): Let A(R P , R cp ) be an analysis space, X c= R P and U:X^Vg R cp a fuzzy partition. The partition entropy PE of the partition U is defined by V PE=- 1 ZX u kl x ln K)- ( 3 - 24 ) n i=l k=l Similar to the partition coefficient, we can show that < V PE < ln(c) holds for any fuzzy partition U: X— >V. When the fuzzy memberships are uniformly distributed, Vpe tends towards to ln(c). If U is the most ambiguous fuzzy partition, then the maximum value of Vpe is obtained. The entropy, i.e., the mean information content of a source which describes a correct partition, is 0, which is also obtained when U is a hard partition. To find the best fuzzy partition, we therefore look for the partition with minimum entropy, i.e., V PE ("U t ) = min V PE (U k ) . l<k<c Both PC and PE have the problem of over-fitting in estimating the number of clusters, c, because the use of the all memberships in the validity measure does not necessarily increase the dependence on this parameter. This is why the PC and PE factors tend to increase for large values of c and provide optimal decisions only on unambiguous partitions. For a partition to be good, its clusters should satisfy the compactness and separateness conditions. A partition contains compact and separate clusters if any two objects from the same cluster have a smaller distance than any two objects from different clusters. 71 Fukuyama and Sugeno [93] first proposed a simple approach to measure the cluster separation using the geometry of cluster centers; the more the clusters scatter away from the data central point, the corresponding solution is preferred. Definition 3.5 (Fukuyama-Sugeno - FS cluster index): Let (R P , R cp ) be an analysis space, X cz R p and U: X^V e R cp a fuzzy partition. The Fukuyama-Sugeno cluster index of the partition U is defined by V F s=J-XIu h ra d 2 (v k) v), (3.25) k=l i=l where v = XI-i v k/ c • ^ or eac ' 1 c l uster v k> 1^ k < c, the inequality d 2 (xi,Vk) < d 2 ( v,Vk) holds for every data object x^ l<i< n. We, therefore, have Vfs ^ 0. V F s = when there is one cluster in the fuzzy partition. The more the clusters are away from the central prototype, v the more separated the clusters are. To find the best fuzzy partition, we therefore look for partition with minimum Vfs- There are still the cases, however, where some clusters are far away from v but still close to each other. Dunn [123] proposed a cluster index which measures both the compactness and the separation of the partition. This cluster index, therefore, was named the compact and separated cluster index. However, Dunn's index is only applicable to hard partitions. To solve this problem, Xie and Beni [94] proposed their cluster index using a separation factor where the distance between clusters is also taken into account. 72 Definition 3.6 (Xie and Beni - XB cluster index): Let (R P , R cp ) be an analysis space, X c R p and U: X^V e R cp a fuzzy partition. The Xie-Beni cluster index of the partition U is defined by n c J, EKd 2 (x„v k ) V XB = 2 - 5- = ^^ , (3.26) nxSep(V) 2 nxD min (V) 2 l<k,l<c,k*l where Sep(V) = D min (V) = min {d(v k , v { ) } is the minimum distance between cluster centers of the partition. Since, n c n c ZZ u ki d2 ( x i> v k) ZZ u ki d2 ( x i> v k) * c Z u ki d2 ( x i> v k) i=l k=l <> i=l k=l _ i X^ i=l < n =rS- Y.< k = l IX i-l i=l c ^diam(v k ) 2 < ^^ < max{diam(v k ) 2 } . c i<k<c' k/ We have V xp < — - , where DI is the Dunn compact and separated index; DI= D - (V) max{diam(v v )} l<k<c k Compared to DI, the Vxb index is a much simpler computation, and it takes the membership degrees into account so that a detour via a hard partition is not necessary [121]. However, the cluster indices, PC, PE, FS and XB do not consider the clusters shapes and volumes. Gath and Geva, based on the concepts of hypervolume and density, proposed three cluster hypervolume measures which are the basis of their cluster validity index and the indices of many other researchers. 73 Definition 3. 7 (fuzzy hyper volume - FHV): Let (R P , R cp ) be an analysis space, X cz R p and U: X— »V e R cp a fuzzy partition. The fuzzy hypervolume of the partition U is defined by k=l n / n where Fk, F k = ^u ki d 2 (x i? v k ) /^u ki is the fuzzy covariance matrix of Vk [125]. A i=l / i=l fuzzy partition can be expected to have a small value of Vfhv if the partition is compact. However, Vfhv measures only the compactness of the partition. Rezaee et al. [95] extended the Vfhv index by adding a term that measures the scattering between clusters in the partition. Definition 3.8 (Compose within & between scattering index - CWB, Rezaee et al): Let (R P , R cp ) be an analysis space, X cz R p and U: X— »V e R cp a fuzzy partition. The compose within and between scattering index [95] of the partition U is defined by V CWB = aScat(V) + Dis(V) . (3.28) Scat(V) measures the average scattering of c clusters in V and is defined as V. Scat(V) = 'FHV C X CJ« ( x )| 1 n II — where ^(X) = — ^ x i - x n i=l — — 1 n and x, x = — Vxj is the center point of X. Dis(V) measures the scattering between c clusters of V and is defined as 74 c f c ^"' Dis(V) = ^X|XIK- v mm k=l 111 V 1=1 The first term of Vcwb, i.e., Scat() in (3.28), indicates the average of scattering variation within the c clusters of the partition. The smaller this term is the more compact the partition is. However, this term does not take into account any geometric assumptions of the cluster prototypes. The second term of Vcwb, Dis() indicates the average scattering separation between clusters. It is influenced by the geometry of the cluster centers and will increase if the number of clusters increases. For the best partition, we look for the partition with the minimum value of Vcwb- Definition 3.9 (Pakhira, Bandyopadhyay and Maulik fuzzy cluster index - PBMF): Let (R p , R cp ) be an analysis space, XcR p and U: X->V e R cp a fuzzy partition. The PBMF index [96] of the partition U is defined by v V PBMF 'IE j max V c J i J (3.29) where Ei = na(X) which is a constant for each dataset. The term Ji measures the compactness, while the term D max measures the scattering separation of the clusters in the partition. A good partition must have either a small Ji or a large D max . To find the best partition, we therefore look for the partition with the maximum value of Vpbmf- Definition 3.10 (Babak Rezaee - BR cluster index): Let (R P , R cp ) be an analysis space, X c R P and U: X->V e R cp a fuzzy partition. The BR index [97] of the partition U is defined by 75 V = V BR Sep(V) • + ■ max{Sep(V)} max{J 2 } (3.30) where Sep(V) = c ( c 1) k,l=l,k*l ^C REL (v k ,Vj). Instead of using geometric distance between the two centers as the distance of every pair of clusters, Rezaee [97] used the relative degree of sharing or connectivity, Crel of the two clusters, defined as in (3.31). To compare the performance of the global cluster validity indices from (3.23) to (3.30), we chose the three previously used real datasets: Iris, Wine, and Glass and the artificial dataset ASET2 with 200 data points in five well-separated clusters (Figure 3-2). 110 100 90 80 70 60 50 40 30 20 10 it. :*% fc-- >.fc" -20 20 40 60 80 100 120 Figure 3-2 : ASET2 dataset with five well-separated clusters 76 For each of the four selected datasets, we ran FCM with the number of clusters set from 2 to 10, each partition, corresponding to a cluster number, was evaluated using the cluster validity measures. The results are shown separately for individual validity measures in Figures 3-3 - 3-9. 0.9 0.8 0.7 0.6 0.5 1 0.8 0.6 \ 0.4 0.2 1 ' 1 ■ i i i i 2 3 4 5 6 7 8 9 10 Iris • Wine — - Glass ASET2 Figure 3-3 : PC index (maximize) Figure 3-4 : PE index (minimize) The two indices, PC in (3.23) and PE in (3.24) behave very similarly; when PC decreases (Figure 3-3), then PE (Figure 3-4) increases and vice versa. These two measures, however, prefer partitions less ambiguous; they correctly identified the number of clusters in the artificial dataset, ASET2, but on the real datasets, where the clusters significantly overlap, they failed. 10 n 2 ^ ^ 1.5 1 -20 - 0.5 2 3 4 5 6 7 8 9 10 Iris •• Wine Glass ASET2 Figure 3-5 : FS index (minimize) Figure 3-6 : XB index (minimize) 77 500 -i -,-* 25 n 20 - 15 - 10 - 5 ,,"' ''J^*-^ /\. / ^^ *^^ ^"» ;<. /•*•*"**" ^— - -^ v. ssus*:--J. — w.. wm%pm ** *»* — . «. i i i '345 — — — Iris Wine i i i i i 6 7 8 9 10 Glass ASET2 3 4 -- Iris 5 6 • Wine i i i i 7 8 9 10 Glass ASET2 Figure 3-7 : CWB index (minimize) Figure 3-8 : PBMF index (maximize) ■0,3 Figure 3-9 : BR index (minimize) The other validity measures also correctly identified the cluster number in the ASET2 dataset. However, with the exception of PBMF with the Iris dataset (Figure 3-8), all failed to detect the correct cluster number in the real datasets. For further comparisons between these validity measures, we used 84 artificial datasets [147] generated using a finite mixture model as in Section 2.5.1. Datasets are distinguished by the dimensions and cluster number, and we generated (3-2+l)*(9- 3+l)=14 dataset types. For each type, we generated 6 datasets, for a total of 6*14=84. For each artificial dataset, we ran the standard FCM five times with m set to 2.0. In each case, the best fuzzy partition was then selected to run the validity measures to search for the optimal number of clusters between 2-12 and to compare with the known number of 78 clusters. We repeated the experiment 20 times and averaged the performance of each method. Results of these comparisons are shown in Figure 3-10. o 1^ u\ QJ 4-' 0/ 3 0.9 0.8 0.7 0.G 0.5 0.4 0.3 7x! . . ^ <r v> \ ? \*v />",V / \\ — \ *•-*" / , — \ — /—j ■«■ — * — \ / + \ * — «• ^\ lu v ^ v hi X V V V' % - - XB V-,1.-- ^L CWB PBMF BR f T 3 4 5 6 7 8 9 Number of clusters Figure 3-10 : Performance of the global validity measures on artificial datasets with different numbers of clusters The PC measure provides better results than the PE measure. Measures using the geometry of the prototypes and the geometric distance measures i.e., PBMF, CWB and FS, have much better performance than those of validity measures not using prototype geometry. However, using geometric information for measuring the partition compactness does not help validity measure with better performance. The cluster compactness should be measured using the memberships. This is reasonable because the data do not always distribute equally, and the cluster volumes, therefore, may differ from cluster to cluster; it is not necessary that any two data points having the same 79 memberships should have the same distance to the centers of clusters to which they belong. The selection of validity measures therefore should be specific to the individual application. 3.5.2 Local validity measures Determination of the optimum cluster number using global validity measures is very expensive, because clustering must be carried out for multiple cluster numbers. If a perfect partition cannot be recognized by a single run, we could at least filter the presumably correctly recognized clusters [121]. A validity measure, called a local validity measure is, therefore, necessary to evaluate clusters individually. This validity measure helps to correct the partition more directly. The concept of local validity measures is that, by running FCM with a possibly maximized cluster number [122] and analyzing the resulting partition, we provide an improved initialization for the next run. We continue this procedure until finally the whole partition consists only of good clusters. If the final partition with only good clusters is found, we have determined the optimal number of clusters and the best assignment of data points to clusters. Hence, in addition to addressing the problem of cluster validation, the local validity index helps with the problem of parameters initialization. Definition 3.11 (Relative degree of connectivity of two clusters - Crel)' Let (R , R cp ) be an analysis space, XcR p and U: X^V e R cp a fuzzy partition. The relative degree of connectivity of the two clusters Vk, vi e V is defined as [97, 127] n C REL (v k ,v 1 ) = J]min{u ki ,u li }xe i , (3.31) i=l 80 where q { =-^u ti ln(u ti ). ei is the entropy of the c-partition at xi g X and has a t=i maximum value of ln(c) when Xi has the same membership degree to all clusters. Hence, we have 0<C REL (v k ,v 1 ) < n n ^ c y . The two clusters Vk, vi are well separated as Crel^O. Definition 3.12 (Inter-cluster connectivity - Cic): Let (R p , R cp ) be an analysis space, X cz R P and U: X^V e R cp a fuzzy partition. The inter-cluster connectivity [129] of the two clusters Vk, vi e V is defined as C lc (v k ,v 1 ) = |RF kl | + |RF lk |, (3.32) where RF kl = { Xi g X | u kl = max{u tl },u H = max{{u tl } Kt , c \ {u kl }} ). The Cic(vk,vi) shows the overlap region between clusters Vk and vi. If the two clusters are well separated then Cic(vk,vi) — » 0. A greater Cic is an indication of a greater degree of the similarity between Vk and vi. If Cic(vk, vi) is larger than a selected threshold, say 0.5, then the two clusters Vk and vineed to be combined. Definition 3.13 (fuzzy inclusion similarity - Sfi)'- Let (R , R cp ) be an analysis space, X c: R P and U: X^V e R cp a fuzzy partition. The fuzzy inclusion similarity [130] of the two clusters Vk, vi e V is defined as Z min K> u i i=l f n n S FI (v k ,v,) = ^ ir ^ r . (3.33) . i=l i=l 81 Definition 3.14 (link-based similarity - Slb)'- Let (R P , R cp ) be an analysis space, XcR p and U: X^V e R cp a fuzzy partition. The link-based similarity [131] of the two clusters Vk, vi g V is defined as c E S Mir( V k, V l) SlbK,v 1 ) = ^— ^ , (3.34) where S!„ dir (v k ,v 1 ) = min{S Dir (v k ,v t ),S Dir (v t ,v 1 )}, S^=max{S LB (v k , Vl )}, l<k,l<c and %ir(^^l) = | V k nV l| kuvJ Because Vk and vi are fuzzy sets, according to the fuzzy set theory, we have n ^min{u ki ,u H } S Dir (v k ,v 1 ) = ^ . Z max Ki> u ii} i=l The cluster merging technique evaluates clusters for their compatibility (similarity) using any of the local validity measures discussed above and merges clusters that are found to be compatible. In every step of this merging procedure, the most similar pair of adjacent clusters is merged if their compatibility level is above a user-defined threshold y, and the number of clusters is therefore gradually reduced. 82 Either the compatible cluster merging (CCM) algorithm [126, 127, 128] or the agglomerative HC algorithm [132] can be used for this process. In the case of Iris dataset, we first ran FCM with the number of clusters set to 12, which is equal to LVnJ, where n=150, and is clearly greater than the optimal number of clusters. The HC algorithm was applied with the weighted linkage used as a local validity measure to merge adjacent compatible clusters. In Figure 3-11, it can be seen that most of the clusters were merged into two large groups and the third group contains only one cluster. 2 7 6 3 9 4 11 1 5 10 12 Figure 3-11: Dendrogram of the Iris dataset from a 12-partition generated by FCM For the Wine dataset, we ran FCM with the number of clusters set to 13, which is equal to LVnJ, where n=178, and is larger than the optimal number of clusters. The HC algorithm was applied with the weighted linkage. It can be seen in Figure 3-12 that the clusters gather into three groups. 83 1.2 0.8 0.6 - 0.4 0.2 9 10 3 12 5 7 11 1 4 2 6 13 Figure 3-12: Dendrogram of the Wine dataset from a 13 -partition generated by FCM By considering only the highest levels of the dendrograms as in Figure 3-11 and Figure 3-12, the number of clusters in the Iris and Wine datasets can be identified correctly. However, in the fuzzy cluster analysis, several prototypes can share one cluster. A local validity measure based on memberships can only identify a prototype pair instead of all of these prototypes. For example, in the case of the Iris dataset, the clusters 2, 3, 7, 8 and 12 share the class Virginica, but only cluster 12 was separated to form a final cluster. For the Wine dataset, the clusters 1, 2, 3, 4, 6, 9 and 10 share the overlap region between classes 2 and 3 but only the clusters 9 and 10 are grouped to form a separate cluster. To solve this problem, the best cluster is removed after each merging step and the distances among the rest are recomputed [126-132]. This 84 procedure is repeated until no further good clusters are found or the compatible level is under the threshold value y [128]. 3.6 FCM partition matrix initialization Theorem 3.4 shows that FCM algorithm is very sensitive to the initial value, U°, of the partition matrix. The determination of U°, or the initialization of the partition matrix, can help the algorithm to converge quickly to the global optimum or get stuck in local optima. A common approach in most partitioning clustering methods is to generate the partition matrix randomly. However, this requires different mechanisms for different datasets because of differing data distributions. 120 100 —-;-+-+!■ OU 60 40 I +-+-+* i - - - +. +- -+X * + * * ■ * *!****** 5 * 4 ; ♦ ♦♦J- &~ + * \ 2 + t 6 + y i * 3 * + + **+ * + 20 40 60 sc 100 -2: Figure 3-13 : ASET1 dataset with 6 clusters 85 We used the artificial dataset 1 (ASET1) as an example. This dataset contains 171 data points in six clusters (Figure 3-13) generated using a finite mixture model as described in Section 2.5.1. We ran the three standard clustering algorithms: k-means, k- medians and FCM with the number of clusters set to 6, and the partition matrix generated using a non-uniform random method. The cluster partition found by each algorithm was compared with the classification structure of the dataset. We repeated the experiment 20 times and averaged the performance of each algorithm, as shown in Table 3.4. Because of the random value of the U°, these algorithms could not always converge to the local optimum. Hence, the correctness ratios for every class and the whole dataset are sometimes off. Table 3.4 : Performance of three standard algorithms on ASET1 Algorithm Correctness ratio by class Average Ratio 1 2 3 4 5 6 k-means 0.97 0.87 1.00 1.00 1.00 0.75 0.93 k-medians 0.95 0.82 1.00 1.00 1.00 0.62 0.90 FCM 0.97 1.00 0.95 1.00 1.00 0.96 0.98 An alternative approach is to search for the best location of the cluster candidate using data points in the dataset. An integration of FCM with optimization algorithms such as GA and PSO is preferred. To manage a solution set, of which each solution is a set of data points selected as cluster candidates, the optimization algorithms try different values of the partition matrix. In this way, they solve the problem similarly to the method of initializing the matrix randomly, as we mentioned previously, which may be appropriate for some datasets, but not for others. 86 A simple and yet effective method for the problem of initializing the partition matrix is subtractive clustering (SC). The subtractive algorithm, originally developed by Chili [63], is a notable improvement to FCM, in which an improved mountain method was proposed to determine the number of clusters and estimate their initial values. Instead of generating the cluster candidates randomly, the SC method searches for the most dense objects in the dataset. Li et al. [64], Liu et al. [65], and Yang et al. [67] used SC to locate the most dense data points as cluster candidates. The partition matrix is then computed using the selected data points as cluster centers. However, the SC method requires the mountain peak and mountain radii numbers to be specified a priori, which may differ among datasets. Yang et al. [66] addressed the problem by a method that automatically computes these values from the dataset. However, the data distribution may not be identical among clusters, and the estimated parameters may be appropriate to some clusters but not others. Cheng et al. [68] proposed a weighted SC method that makes the cluster candidate selection more flexible by allowing the cluster centers to be located at the means of the dense data points instead of selecting the most dense data points themselves. This approach can help FCM work effectively. However, it still requires that all the SC parameters are specified a priori. 3.7 Determination of the number of clusters FCM has a disadvantage that the number of clusters must be known in advance. When this is not known, the results with different numbers of clusters have to be compared using a validity function in order to find an optimal fuzzy partition with 87 respect to the new objective function. There are two common methods to address the problem of number of clusters using this approach [121]. 3.7.1 Between-cluster-partition comparison This method defines a global validity measure which evaluates a complete cluster partition. An upper bound c max of the number of clusters is estimated, and a cluster analysis using FCM is carried out for each number of clusters, c, in {2,...,c max }. For each partition solution, the validity measure generates a validity index which is then compared with the indices of other partition solutions to find the optimal number of clusters. 0.95 Number of clusters Figure 3-14 : PC index on the ASET2 dataset (maximize) For the case of ASET2 dataset, we used Vpc, for example, as global validity measure, and set c max to 14 which is LVnJ, where n=200 the size of ASET2. FCM algorithm was run with the number of clusters set from 2 to 14; each partition, corresponding to a certain number of clusters, was evaluated using Vpc- The results are 88 shown in Figure 3-14. To find the optimal number of clusters, we look for the partition with the maximum value of Vpc (Definition 3.3). Therefore, five is the optimal number of clusters, and is also equal to the number of classes in ASET2. Hence, in the case of ASET2, Vpc correctly detected the number of clusters. 3.7.2 Within-cluster-partition comparison This method defines a local validity measure that evaluates individual clusters of a partition. An upper bound c max of the number of clusters is also estimated and a cluster analysis is carried out for c=c max . The resulting clusters are then compared using the validity measure. Similar clusters are merged into one and very bad clusters are eliminated. These operations reduce the number of clusters in the partition. Afterwards, a cluster analysis is carried out on the remaining clusters. This procedure is repeated until the analysis result no longer contains either similar clusters or very bad clusters. Both 3.7.1 and 3.7.2 are based on the validity measures that we discussed in 3.5. Hence, the problems with using cluster validity measures may cause these methods to remain in local optima when searching for the "best" number of clusters; they assume that the cluster analysis results are optimal for the given numbers of clusters. However, FCM does not guarantee this [132]. Hence, an optimal cluster partition found by these methods is possibly only a local optimum. In general [121], to evaluate a possibilistic cluster analysis, we must consider both the quality of the clusters and the proportion of the classified data. This is especially difficult with quality measures for individual clusters according to the method in 3.7.2 [121]. In this case, the method in 3.7.1 is therefore recommended. 89 3.8 Determination of the fuzzifier factor The performance of FCM is affected by an exponent parameter, called the fuzzifier factor, m. This parameter affects both the convergence rate and the cluster validity of the algorithm. m=l, FCM behaves like K-Means, say all clusters are disjoint. m^oo, we have limiu = lim = lim m-l d(x,>v k ), 1=1 m-l 1=1 ^ d 2 ( Xi ,v k ) V-' v d 2 (x i ,v ] ) y .^(Xf.vOy = - for 1 < k < c and 1 < i < n, c lim{v k }=lim 2«Sx,/S»S L i=l / i=l n 2>. _ i=l ■ = x for 1 < k < c, where x is the grand mean of X. Thus, when m goes to infinity, all the data points are considered to belong to only one cluster. Wu [144] ran FCM on the Iris dataset with different values of m (m=l.l, 1.5, 2, 3, 4,. . ., 50) and the number of clusters set to 3, and measured the misclassification rate in each cluster partition as shown in Figure 3-15. As m becomes large, the FCM algorithm becomes stable and no improvement can be achieved, even with additional iterations. If Ue M fcn , the constraint as in (3.6) makes it difficult to interpret Uki as the presentative of the k th cluster at xi. This is a very 90 important problem with FCM, and any application of FCM should take the determination of m into account. 17 — 16 — CO O o 15 — 1— o 1— 1— 14 — 13 — 12 — m=L5 m=2 frQ g e»frg»frg « »»frggg p egqgqeg * **<■ < ■< T 10 20 30 40 50 m Figure 3-15 (Wu [144]) : Impact of m on the misclassification rate in Iris dataset To date, there are no reliable criteria for the selection of the optimal fuzzifier for a given set of training vectors. Instead, the best value, with respect to a given value range of m, is fixed through experimentation; i.e., this choice is still heuristic. To our knowledge, only a few methods exist to determine an optimal value of m. Romdhane et al. [57] proposed a method to estimate the value range that should contain the optimal values of m, by defining the limits of the overlap ratio between clusters. An iteration process is then carried out until the optimal value is reached. This approach has provided an easy yet powerful way to estimate the value of m. However, it has the drawback that its optimal solution is strongly dependent on the initial random value of m. Dembele et al. [60] developed a method to estimate the value of m using the number of clusters and the estimated membership. The key idea of this method is that as long as m runs toward to its upper bound value, the membership of all data points will converge 91 to 1/c, where c is the number of clusters. They estimate the upper bound value of m using this idea. This method does not have the problem of local optima caused by the use of an initially random value of m. However, it requires the number of cluster to be specified a priori. To address this issue, Schwammle et al. [61] proposed a method using cluster validity indices to estimate the value of c before determining the optimal value of m. Similarly, Zhao et al. [62] proposed a model-based approach for the m estimation using the predefined value of c, which varies from 2 to C where C is the possibly maximum number of clusters. The estimation methods of these two approaches are based on the fuzzy partition generated using the predefined value of c. This value is not guaranteed to be optimal, because most of the cluster validity measures are not appropriate to the real datasets [147]. The fuzzy partition generated by the FCM algorithm, using the selected number of clusters, is also not optimal because FCM may remain in local optima. 3.9 Defuzzification of fuzzy partition Defuzzification in fuzzy cluster analysis is a procedure to convert the fuzzy partition matrix U into a crisp partition which is then used to determine the classification information of the data objects. The most popular approach for defuzzification of fuzzy partition is application of the maximum membership degree (MMD) principle [148, 152, 153]. The data object Xi is assigned to the class of Vk if and only if its membership degree to cluster Vk is the largest, that is u kl =max{u H }. (3.35) 1=1. .c 92 This approach may be inappropriate in some applications, because FCM membership is computed using distance between the data object and cluster center. The use of membership degree can assign marginal objects of a large cluster to the immediately adjacent small cluster. This is illustrated in Figure 3-16, where if a data object is in the gray rectangle, it may be incorrectly assigned to cluster 3 instead of cluster 2. 50 40 30 20 10 - 1 1 1 - + + *o* + + ♦ ♦ + +-H- + * * + + + + + + + 20 30 40 50 60 70 80 90 100 110 120 Figure 3-16 : Artificial dataset 5 (ASET5) with three clusters of different sizes Genther, Runkler and Glesner [151] proposed a defuzzification method based on the fuzzy cluster partition (FCB), where the partition information is applied in membership degree computation. For each data object Xi, i=l,...,n, an estimated value, (Li(xi), is computed using the fuzzy partition. The set of estimated values of X, {n(xi)}i=i,...,n is then used to compute the estimated value of Vk, Wk, k=l,. . .,c, as: w k =s^^i)/Z u S (3.36) i=l i=l The distance between the data object and cluster center, d F , in FCB is computed using both the actual and the estimated information as: 93 d F (x, , v, ) - nd ' ( f" v '> + Pd;< f (Xi) ' W ' ) , (3.37) \ max{d (x-,v } max{d (^Xj^Wj)} V j,i j,i where d?() is used, instead of d(), to compute the membership degree. The two parameters, a and (3, weight the distances from X and its estimated to cluster prototypes. Chuang et al. [150], in the Neighborhood Based Method (NBM), on the other hand proposed adjusting the membership status of every data point using the membership status of its neighbors. Given a cluster Vk, k=l,...,c, a data point, xi, i=l,...,n, receives an additional amount of membership degree from its neighbors: K = 2X- ( 3 - 38 ) XjeNBCxi) This additional information is used in computing the membership degree, as in (3.39), so that the membership level of each data point is strongly associated with those of its neighbors that will help with determination of classification information. Here, U NB = M*(KY (339) Z(uH) P x(h H ) q 1=1 Similar to FCB, the NBM method requires a definition of neighborhood boundaries and the values of p and q which weight the position and spatial information. Chiu et al. [149] also used the spatial information to compute the distance between data point and cluster center in their sFCM (Spatial FCM) method, as in (3.40), i nb r n d s ( Xi ,v k ) = — £[(l-a i j )d(x i ,v k ) + oid(x j ,v k )], (3.40) 94 where NB is the number of neighbors of every data point, a- is the weighted factor of Xj, a neighbor of xi, defined as: aj = l/{l + ex V (A)/o i )), (3.41) where A^ = d(x j ,x i ). sFCM uses ds(), instead of d(), to compute the membership degree, and updates cluster centers, { v k }, k=l,. . .,c, as follows: v^ =2>£ 1 NB YO-oOXj +a i i x i K . (3.42) Algorithm performance measures The overall performance of defuzzification methods can be measured based on the number of data objects that are misclassified, defined as in (3.21); the cluster label of each data object is compared with its actual class label. If any do not match, then a misclassification has occurred. The compactness measure, defined as in (3.43), can be used to validate how the actual classification structure retains as the missing data are updated, c N k compactness = ^^d(x i9 v k ). (3.43) k=l i=l We tested the four defuzzification algorithms: MMD, FCB, NBM and sFCM using the two datasets ASET4 and ASET5 with various values of the fuzzifier factor, m. Each algorithm was run 100 times and the results were averaged. The number of neighbors was set to 8. For the ASET4 dataset which contains 300 data objects of 11 well- separated clusters in a 5-demensional data space, the result is shown in Table 3.5. 95 Table 3.5 : Algorithm performance on ASET4 using MISC measure 1.25 1.375 1.5 1.625 1.75 1.875 2 MMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 FCB 12.24 12.24 12.24 12.24 11.72 11.71 11.97 NBM 5.89 5.56 1.97 1.97 1.94 2.03 1.80 sFCM 12.23 9.60 0.30 0.75 0.84 0.84 0.84 In the Table 3.5, only the MMD method correctly detected the classification information across various values of m. The other methods performed worse, particularly the methods that used spatial information to revise the distance between data object and cluster center, FCB and sFCM. That may be because the assumption of 8 as the maximum number of neighbors for each data point, which is correct in spatial space, is not appropriate for generic data space. We ran the same benchmark on the ASET5 that contains 150 data points of 3 clusters different in size, see Figure 3-16. The result is shown in Table 3.6. Table 3.6 : Algorithm performance on ASET5 using MISC measure algo^"\ 1.25 1.375 1.5 1.625 1.75 1.875 2 MMD 16.00 16.00 16.00 16.00 16.00 16.00 16.00 FCB 4.32 4.32 4.59 4.59 4.32 4.32 4.32 NBM 27.19 27.19 27.19 27.19 27.19 27.19 27.19 sFCM 17.00 17.00 17.00 17.00 16.00 16.00 16.00 The FCB algorithm performed better than the other methods. The methods that use spatial information in computing distance between data object and cluster center, e.g., 96 FCB and sFCM provided better results. However, these algorithms failed at producing classification information for this dataset. The test results have shown that it is not appropriate to apply the spatial information into the defuzzification problem in cluster analysis of generic data, because properly determining the number of neighbors for each data point is complicated, and the choice of the two parameters, a and (3, in these methods may not be easy due to the data distribution difference among different datasets. 97 4. Methods 4. 1 Microarray meta-analysis 4.1.1 Selection of samples and features Key challenges in meta-analysis methods arise from the small sample sizes typical of microarray studies and variations due to differences in experimental protocols. Both affect the results from individual studies and inevitably degrade the final results of meta-analysis. In addition, the main outcome of these approaches is the ranking of genes for significance, however, it is still unknown which feature, gene expression value, statistic test p-value, likelihood measurement, or any other statistical measures, best represents of outcome measures. This chapter addresses the problem of choosing samples from multiple studies with the goal of reducing the variance and discovering more biological meaning. In addition, because microarray data are specific to the biological context of the experiment, in the solution proposed, the meta-analysis will reflect the appropriate biological context. 4.1.2 Sample and feature selection method Common approaches for sample and feature selection are PCA, hierarchical clustering analysis, optimization algorithms and ranking methods. The first three approaches are limited because their results are strongly affected by the lab in which the experiments were carried out [25, 49]. These approaches are also computationally complex, because they use pair- wise sample distance measurements. When the number of probes or features is large, this process becomes time consuming, particularly with 98 the use of external metrics. Lastly, these methods require the data to be obtained using the same microarray platform. The alternative approach is the ranking method. Because ranking itself is invariant to within-array monotonic data transformations, by applying the rank-based TSPG (Top-Scoring Pair of Groups) method, which performs on-chip comparisons within each microarray, no data normalization or transformations are required before integration [169]. Feature ID mappings / Selection For microarray data obtained using the same generation of the same platform, we directly merge individual data sets using probe sets common to all experiments. To merge data from different generations of the same microarray platform, we first use the Affymetrix probeset ID mapping to annotate all probesets. Probes that are not annotated will be processed using EntrezGene with UniGene data. Probes having two or more gene IDs will have their data replicated, one for each unique gene. To merge data from different platforms, we first map all probe identifiers to gene identifiers. Probes having two or more gene IDs will be preprocessed as above. This method avoids excluding probes from further analysis. In this study, we focus on integrating microarray data from the CMAP datasets which were generated using the same platform but different generations. Specifically, we integrate microarray data generated from the Affymetrix HGU-133x and HG-U95x platforms to obtain data sets with large (>500) sample sizes. This integrated data set is used to discover common gene expression signatures in response to drug treatments [170]. 99 Sample selection We developed RPPR algorithm, a combination of the rank product (RP) statistics method with a modified version of the PageRank™ (PR) algorithm and the goodness-of-fit Kolmogorov-Smirnov statistical test method. The RP method generates the gene rank for gene expression profiles. The PR algorithm is then applied to augment the gene ranks using Protein-Protein interaction and GO Biological Process annotation data. The KS statistical test method is used to filter the expression profiles of which the gene ranks best describe a given gene expression signature. Given a sample expression profile X and a gene expression signature S, denote S u and Sd as the lists of up-regulated and down-regulated genes, respectively. The connectivity score between X and S is computed using the Kolmogorov-Smirnov scores of goodness-of-fit (KS) of the two sub lists in S to the profile X. The KS score of a tag list s to X is calculated as follows: Sort X based on the appropriate expression status of genes. Denote by t and n the lengths of s and X, respectively. Construct a vector O of the position (1,. . .,n) of each probe set of s in X and sort these components in ascending order such that 0(j) is the position of tag j, where j = 1, 2,. . ., t. Compute the following: a = max l<i<t O(i) n b = max l<i<t O(i) (i-1) n t (4.1) next KS(s,X) = a, a >b -b, a<b' (4.2) 100 For each expression profile associated with a specific drug, X p , p=l,...,P, where P is the number of profiles, let KS^ p =KS(S u ,X p ), KS p own =KS(S d ,X p ). The KS connectivity score of the profile X 1 to the gene expression signature S is defined as s'=KS up -KS down . (4.3) Let SL=m P ax(s'), SL=mW). 1=1 1=1 The normalized connectivity score of the profile X p to the gene expression signature S is defined as: S p = 0, sign(KS;) = sign(KS p own ) s p /SL, s p >0 . (4.4) -s p /sL, s p <0 Because RPPR uses only the preprocessed expression data, it is also appropriate for datasets generated using next generation sequencing, RNAseq and other high- throughput technologies. Method testing We tested RPPR using the CMAP datasets to detect expression profiles strongly related to the expression signatures, SI and S2, of two drugs, the Histone deacetylase 101 (HDAC) inhibitor (SI) and estrogen (S2). We compared the performance of RPPR with those of Lamb et al. [25] and Zhang et al. [26]. HDAC inhibitor signature CMAP arrays were preprocessed using RMA resulting in 564 expression profiles. The RP [29] method was then applied to every study to detect genes differentially expressed and to order genes in each profile. The pattern matching ranking method was used to score all the profiles with respect to the HDAC inhibitor signature [25]. We used a cutoff of 0.95 for the normalized connectivity scores to select profiles strongly related to the signature. Results, compared with those of Lamb et al [25] and Zhang et al [26], are shown in Tables 4.1 and 4.2, and Figures 4-1 and 4-2, respectively. Table 4.1 : Predicted HDAC antagonists (1: Detected; 0: Not detected) Drags Lamb Zhang RPPR valproic acid 1 trichostatin A 1 Resveratrol HC toxin 1 Vorinostat 1 Wortmannin 1 5666823 Prochlorperazine 1 7-allylamino-geldanamycin 102 Table 4.2 : Predicted HDAC agonists (1 : Detected; : Not detected) Drags Lamb Zhang RPPR estradiol 1 1 genistein 1 1 nordihydroguaiaretic acid 1 1 thioridazine LY-294002 staurosporine indometacin 1 1 alpha-estradiol 1 1 tretinoin doxycycline 1 troglitazone 1 5149715 1 12,13-EODE 1 1 7-allylamino-geldanamycin butein wortmannin tetraethylenepentamine diclofenac 4,5 -dianilinophthalimide pirinixic acid haloperidol 103 Lamb Zhang /^ "x^ \ ( 1 I ° \ 3 ) /o v / RPPR Figure 4-1 : Venn diagram for predicted HDAC antagonists Lamb Zhang •""""'" ^\, y^ y 3 (° X \ 8 \ ^ /T \ 7 Y ; r- \ 4 / \^ ^^y =IPPR Figure 4-2 : Venn diagram for HDAC inhibitors Figure 4-1 illustrates results of the three approaches using the HDAC inhibitor signature (SI). The results of RPPR are similar to those of Lamb et al. [25] because both used the same scoring method. Vorinostat, trichostatin A and valproic acid, well- known HDAC inhibitors, were all identified by RPPR. HC-toxin was not identified by Lamb et al. [25] or Zhang et al. [26], but was identified by RPPR. Although 17- allylamino-geldanamycin, wortmanin, prochlorperazine were not reported in the top 20 profiles from RPPR, they were ranked with high scores: 0.639, 0.557 and 0.657, 104 respectively. It is interesting that only RPPR detected resveratrol which has recently been recognized as an HDAC inhibitor like valproic acid [119]. Estrogens signature Table 4.3 : Predicted estrogen receptor antagonists (1 : Detected; : Not detected) Drugs Lamb Zhangmine RPPR Tamoxifen Fulvestrant 1 trichostatinA Raloxifene 1 Monastrol 1 5224221 1 Oxaprozin HCtoxin Vorinostat 1 LY-294002 1 Demecolcine 5248896 Y-27632 1 Sirolimus 1 1 7-allylamino-geldanamycin 1 Monorden 1 Sodiumphenylbutyrate 1 Deferoxamine 1 Indomethacin 1 Tretinoin 1 105 Lamb Zhang /3j5</0 J 2 j RPPR Figure 4-3 : Venn diagram for predicted estrogen receptor antagonists Lamb^ ^. Zhang 1 (° x ^ 3 /o X :7 o\ 7 RPPR Figure 4-4 : Venn diagram for predicted estrogen receptor agonists For the Estrogen signature (S2), Figures 4-3 and 4-4 show that results of RPPR are similar to those of Lamb. It is interesting that both RPPR and Zhang's identify estradiol with the highest rank, but only RPPR detected tamoxifen as negatively related to the estrogen signature. In fact, tamoxifen is a compound that works by blocking the estrogen receptors on breast tissue cells and slowing their estrogen-induced growth. Tamoxifen has recently been chosen as one of the best replacements for fulvestrant in breast cancer treatment. The results of RPPR significantly agreed with those of Lamb et 106 al. (2006) and Zhang et al. (2008). Some of RPPR's results which were new to those of Lamb et al. (2006) and Zhang et al. (2008) were also proved by recent literature. Table 4.4 : Predicted estrogen receptor agonists (1 : Detected; : Not detected) Drugs Known target/mechanism Lamb Zhang RPPR Valproic acid Epilepsy treatment, ER stimulator 1 1 trichostatinA HDAC inhibitor, ER sensitizer 1 1 HCtoxin HDAC inhibitor, ER signaling pathway related 1 1 vorinostat Skin cancer treatment, Advanced- stage breast cancer related 1 1 wortmannin Longterm estrogen deprivor (usually used with tamoxifen for a better ER inhibitor) 5666823 Similar to 2-deoxy-D-glucose (0.7765) like Tamoxifen (0.7143) 1 prochlorperazine Nausea and vomiting treatment, Endocrine: not used with hormone replacement estrogens. 1 17-allylamino- geldanamycin Hsp90 inhibitor, Hsp90 is required for the proper functioning of ER 1 Although RPPR used the same goodness-of-fit scoring method as Lamb et al. [25], it is based on different distribution functions generated by the different methods to detect and rank genes that are differentially expressed. In addition, instead of using the same distribution function for both up-regulated and down-regulated genes, RPPR used the up-regulation distribution of the drug-specific profile for the up-regulated genes and the down-regulation distribution of the profile for down-regulated genes. This helps to address the weakness of KS goodness-of-fit test because it is less sensitive at the tailend of the distribution. In restricting the range of the distribution function, the results 107 produced by RPPR are very close to those of Lamb et al. [25] and Zhang et al. [26] although no change was made to the step function part of the distribution function. This restriction helps speed up the scoring process when the distribution function can be approximated well by managing the step function. The experimental results in this study also show that, although CMAP datasets are generated from unreplicated single arrays, the use of meta-analysis with the rank product and page rank methods, and biological knowledge reduces the variance of the selected samples. The results therefore have strong agreement with the biological context of interest; e.g., the HDAC inhibitors and Estrogen receptor. 4.1.3 Methods for feature metric One of the most important issues with meta-analysis is how to define a similarity metric across samples or studies so that results have strong biological relevance. Recent research has shown that, because of differences in experimental designs and laboratory- specific effects, integrated gene expression data from multiple studies may have poor correlation under the biological context of the interest. To overcome this problem, we propose: • To use protein-protein interactions (PPI) with the Random Walk algorithm to determine gene-gene proximities. • To use GO cellular component (CC) and biological process (BP) annotations with term-term semantic similarity to determine gene-gene similarities. 108 4.2 Fuzzy cluster analysis methods using Fuzzy C-Means (FCM) The FCM algorithm has several limitations: the initialization of the partition matrix can cause the algorithm to remain in local optima, the partially missing data may cause the algorithm to work improperly, there is no strong mathematical/statistical and biological basis to validate the fuzzy partition, FCM cannot determine the number of clusters in the dataset, and it does not have an effective method to address the problem of defuzzification, where the method using the maximum membership degree principle (MMD) has been proved inappropriate for gene expression data. To address these problems, we first modify the FCM objective function using the partition coefficient because of its ability to maintain the partition compactness (Section 3.5.1). For initialization of the partition matrix, we developed a new Subtractive Clustering (SC) method using fuzzy partition of the data instead of the data themselves [146]. To address the problem of partially missing data, we developed two new imputation methods using density and probability based approaches [154, 155]. Regarding the cluster validation problem, we developed a new evaluation method for fuzzy clustering, particularly for gene expression data [147]. The evaluation model considers both the internal validity, which is statistical in nature, and the external validity, which is biological. For the problem of cluster number, we developed a new clustering algorithm that integrates our algorithms with the optimization genetic algorithm (GA) to manage a set of solution candidates [157]. Using our fuzzy cluster evaluation method, the optimization algorithm can search for the best solutions during its evolutionary processes. To produce classification information using fuzzy partition, 109 we developed a novel defuzzification method that generate a probabilistic model of the possibilistic model described by the fuzzy partition and apply the model to produce the classification information of the dataset [156]. 4.2.1 Modification of the objective function Both the partition coefficient (PC) and partition entropy (PE) are best measures for the compactness of fuzzy partition. Because PC performs better than PE regarding this issue (Section 3.5.1), PC was, therefore, used as the compactness factor in the FCM objective function of our method: jr n (U,V) = XZ<d 2 (x 1 ,v k ) + ^Zu» ->min, (4.5) i=l k=l i=l k=l where |u, the learning rate, is defined as in (4.78), Section 4.3.1. We consider minimizing (4.5) w.r.t. (3.6) using the method of Lagrange multipliers. Let the Lagrange multiplier be A^ i=l,. . .,n and put i=l k=l i=l k=l i=l k=l For the necessary optimal condition, dL du ki ^ = m(uS- 1 )[d 2 (x 1 ,v k ) + M ]-^ i =0, (4.6) dv k i=i With the same reasoning as in (3.9) and (3.10), we obtain fix n ' ^ = 2X<h-v k | = 0. (4.7) 110 u ki = X, ^ m(d (x ; ,v k ) + n) m-l X, V- 1 m 1 v d (x i5 v k ) + n y m-l (4.8) Summing up for k=l,. . .,c and taking (3.6) into account, we have, Hence, l = Zu ki = f\ \ k=l ^ v m y m-l , k=l ^ v d 2 (x i ,v k ) + |a y m-l X: V-l m Z k=l d2 ( X 1? V k) + l Ll . m-l Together with (4.8), we obtain the following for the membership in the FCM version of our method, 1 d 2 (x 1? v k ) + |u V-i ^ 1 d 2 (Xi,V!) + n ^m-l (4.9) Similarly, we derive from (4.7) for each k, 1 < k < c: i=l / i=l (4.10) The update model of = {U,V} in our FCM method used (4.9) and (4.10). 4.2.2 Partition matrix initialization method FCM has recently been integrated with optimization algorithms, such as, the Genetic Algorithm, Particle Swarm Optimization, and Ant Colony Optimization [161- 166, 176-177]. Alternatively, a Subtractive Clustering (SC) method has been used with FCM, where SC is used first to determine the optimal number of clusters and the 111 location of cluster centers [105-108] and FCM is then used to determine the optimal fuzzy partitions [105, 111-113]. While this approach can overcome the problem of initialization of parameters, it still requires a priori specification of the parameters of the SC method: the mountain peak and the mountain radii. SC uses these two parameters to compute and amend the value of the mountain function for every data point while it is looking for the cluster candidates. Most approaches using the traditional SC method use constant values for these parameters [105, 109-111]. However, different datasets have different data distributions, and, therefore, these values need to be adjusted accordingly. Yang and Wu [106] proposed a method to do this automatically. However, because the data densities are not always distributed equally within the dataset, the automatic values may be appropriate only for some data points. In general, use of the SC method to determine the optimal number of clusters is based on the ratio between the amended value and the original value of the mountain function at every data point. The data points at which the ratios are above some predefined cutoff are selected and their number is used as the optimal number of clusters [105, 107, 110]. This approach, however, requires the specification of the cutoff value which will differ among datasets. Method design - histogram-based density Given = {U,V}, a fuzzy partition on X, the accumulated density at Vk, k=l,...,c, is calculated [112] as 112 Acc(v k ) = Xu kl . (4.11) i=l By assuming a uniform density of the data observed in each cell, an estimate fhist(x) of the underlying probability density function f(x) at any point x of Vk can be computed by: W*) = ^- (4.12) sh where h is the bin width and s is the number of bins generated on Vk. The density at each data point xi is estimated, using a histogram based method, as dens(x x ) = ^Acc(v k )xu kl . (4.13) k=l However, the histogram density estimator as in (4.12) has some weaknesses. It is a discontinuous function, and the choice of both the values of h and s has quite an effect on the estimated density. The apriorism needed to set those values make it a tool whose robustness and reliability are too low to be used for statistical estimation. The density estimator in (4.13) is therefore more true at the centers {vk} than at other data points [1 12]. If we try to find the most dense data points, c, using (4.13), we will obtain the centers of c clusters of the fuzzy partition. To address this problem, we use the concept of strong uniform fuzzy partition. Definition 4.1 (strong uniform fuzzy partition) [112]: Let mi < ni2 < ... < m c be c fixed nodes of the universe d (or universal set), d=[a,b], such that mi = a and m c = b, and c>2. We say that the set of c fuzzy subsets Ai, A 2 ,..., A c , identified with their 113 membership function jliai, |^a2,.-- M^ac defined on 0, form a strong uniform fuzzy partition of 0, if they fulfill the following conditions, for 1< k <c, 1) |LiAk(m k )=l (the protoype). 2) If x g[m k _i, m k +i] then |iAk(x) = 0. 3) |LiAk(x) is continuous. 4) |LiAk(x) monotically increases on [mti, m k ] and monotically decreases on [m k , m k+ i]. 5) Vx e &, 3k such that |LiAk(x) > 0. c 6) Vx e &, ^|w A k( x ) = 1 (strength condition). k=l 7) m k +i-m k = h, 1 < k < c. 8) |LiAk(m k -x) = |LiAk(m k +x), 1< k < c. 9) Vx G[m k , m k +i], |LiAk(x) = |LiAk-i(x-h) (same shape at overlap region). Note that the 6 th condition is the strength condition, i.e. that without this condition the fuzzy partition is no longer strong. Proposition 4.1 (extended fuzzy partition) [112]: Let A k , k=l,..., c, be a strong uniform fuzzy partition on D, extended on D', then 1) Vx, cgD, 3! k g {1,..., c-1}, such that Vk g {k , k +l}, |iAk (x) = 0, and MAko(x) + |iAkO+l (x)= 1. iix k+i 2) Fork= l,...,c, |n Ak (x)dx = h. 3) 3K A : [-!,!]-> [0,1], such that, n Ak (x) = K A L xU I (mk-l,mk + l) ! 114 1 and [ K A (u)du = 1 . Definition 4.2\ The fuzzy histogram density estimator defined on D is given by 1 c f hist ( x ) = -r Z Acc ( v k ) x u Bk ( x ) , nh k=1 where {Bk}k=i,...,c is a strong uniform fuzzy partition defined on D. (4.14) It can easily be shown that fhi St > and J fhi St (x)dx = 1, and that fhist goes through the c points (mk, Acc(vk)/nh). We now define a new fuzzy partition f = {U f ,V}, based on Definition 4.1 and Proposition 4.1, using = {U,V}, ( h-VkllV 1 / r ( V--A\ -1 U k! = 1=1 (4.15) V J where = Z p ( x i|V k )Xi-VJ (4.16) i=l Because {U f ,V} is a strong uniform fuzzy partition [112], using Definition 4.2, the density at every data point can then be estimated as dens(x i ) = ^Acc(v k )xu ki . (4.17) k=l Fuzzy mountain function amendment Each time the most dense data point is selected, the mountain function of the other data points must be amended to search for new cluster centers. In the traditional 115 SC method, the amendment is done using the direct relationships between the selected data point and its neighborhood, i.e., using a pre-specified mountain radius. In our approach, this is done using the fuzzy partition and no other parameters are required. First, the accumulated density of all cluster centers is revised using Acc t+1 (v k ) = Acc t (v k )-M;xP(v k | X ;), (4.18) where x t is the data point selected as the new cluster center and M t is the mountain function value at time t. The density at each data point is then re-estimated using (4.17) and (4.18) as follows, dens t+1 (x 1 ) = XAcc t+1 (v k )xu kl k=l = X[Acc t (v k )-M;xP(v k |x;)]xu kl k=l c c = XAcc t (v k )xu kl - M;ju ki xP(v k | x' t ) k=l k=l c = dens t (x 1 )-M;^u kl xP(v k |x;). k=l Hence, the estimated density at each data point is amended as: dens t+1 (x 1 ) = dens t (x 1 )-M;Ju kl xP(v k \x\). (4.19) k=l 116 fzSC -fuzzy subtractive clustering algorithm [146] Steps 1) Generate a fuzzy partition A using FCM. 2) Create a strong uniform fuzzy partition B using (4. 1 5). 3) Estimate the density at every xi e X using (4.17). 4) Select the most dense data point as cluster candidate. 5) If the stop-condition is met then Stop. 6) Re-estimate the density at every xi using (4. 19). 7) Go to Step 4. To determine the optimal number of clusters in the dataset, we first locate the most dense data points, LVnJ. Only the data points where the ratio between the mountain function values at time t and time is above 0.95, which is considered a stop-condition, are selected. This number is considered the optimal number of clusters in the dataset. Method testing To evaluate the ability of fzSC in searching for the best cluster centers, we ran fzSC on the ASET1 (Figure 3-13) with six clusters. The result is shown in Figure 4-5. We compared fzSC with three standard algorithms, k-means, k-medians and fuzzy c-means, using the ASET1 dataset (Figure 3-13). Each algorithm was run 20 times and the correctness ratio was averaged. Results are shown in Table 4.5. 117 12D r - 1DD - BD -- SD - 4D - 2D ■ >** +** ♦ * •tsir 1 + + *:* ** ++JH+ + ****** tm* + 6 + *** + * + *L * + *Jfr*+ 4 *+ *5L* + *+ *♦**♦+ **F + * V 2: o :=: e: 1DD 12D Figure 4-5 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares, cluster centers by FCM; dark circles, cluster centers found by fzSC. Classes are labeled 1 to 6. Table 4.5 : Algorithm performance on the ASET1 dataset Algorithm Correctness ratio by class Avg. Ratio 1 2 3 4 5 6 fzSC 1.00 1.00 1.00 1.00 1.00 1.00 1.00 k-means 0.97 0.87 1.00 1.00 1.00 0.75 0.93 k-medians 0.95 0.82 1.00 1.00 1.00 0.62 0.90 FCM 0.97 1.00 0.95 1.00 1.00 0.96 0.98 With a correctness ratio of 100% for every class and the whole dataset, fzSC always successfully detected the optimal partition of ASET1 while the other methods using a random method for partition matrix did not. 118 To check the ability of fzSC to detect the number of clusters, we ran fzSC on the artificial datasets 20 times each and averaged the performance. Table 4.6 : fzSC correctness in determining cluster number on artificial datasets The number of clusters The dataset dimension generated in the dataset 2 3 4 5 5 0.97 1.00 1.00 1.00 6 1.00 0.98 0.90 1.00 7 1.00 1.00 1.00 1.00 8 1.00 0.99 0.97 1.00 9 0.87 0.99 1.00 0.96 Regarding the problem of partition matrix initialization, fzSC provides better results in comparison with the three standard clustering algorithms using a random mechanism. When checking the ability of fzSC on artificial datasets, the performance is significant. It performs worse on the datasets with nine clusters in the data space with the dimension of 2. We tried to plot some of these datasets. The results show that some clusters mostly overlap and tend to merge into one. We, therefore, can assume that the performance of fzSC is not good on these datasets because the data were not properly generated. For the real datasets, Iris, Wine, Glass and Breast Cancer Wisconsin, the number of clusters are 3, 3, 6, and 6 respectively. The data in the Wine and Glass datasets were normalized by attributes before clustering. The Pearson correlation distance, which is the most appropriate for real datasets, was used to measure the similarity between data 119 points. We ran fzSC 20 times on each of these datasets with the value of L, the number of solutions, set to 15. The number of clusters found was compared with the known number in the dataset. Results are shown in Table 4.7. Table 4.7 : fzSC performance on real datasets Dataset # data points #clusters predicted #clusters correctness ratio Iris 150 3 3 1.00 Wine 178 3 3 1.00 Glass 214 6 6 5 0.95 0.05 Breast Cancer Wisconsin 699 6 6 5 0.65 0.35 We have proposed a novel algorithm to address the problem of parameter initialization of the standard FCM algorithm [146]. Our method is novel because the subtractive clustering technique uses fuzzy partition of the data instead of the data themselves. The advantages of fzSC are that, unlike traditional SC methods, it does not require specification of the mountain peak and mountain radii, and, with a running time of O(cn), compared to 0(n 2 ) for the traditional SC method, it is more efficient for large datasets. In addition, fzSC can be integrated easily with fuzzy clustering algorithms to search for the best centers of cluster candidates, or to address other existing problems of FCM such as missing data and fuzzifier determination problems. We combined FCM with a new SC method to automatically determine the number of clusters in a dataset. FCM randomly creates a set of fuzzy partition solutions for the dataset using a maximum number of clusters; SC then uses the fuzzy partition approach to search for the best cluster centers and the optimal number of clusters for 120 each solution. FCM then rapidly determines the best fuzzy partition of every solution using the optimal number of clusters determined by SC. The best solution from this solution set is the final result. 4.2.3 Fuzzy clustering evaluation method Recently, many cluster validity indices have been developed to evaluate results of FCM. Bezdek [92] measured performance using partition entropy and the overlap of adjacent clusters. Fukuyama and Sugeno [93] combined the FCM objective function with the separation factor, while Xie and Beni [94] integrated the Bezdek index [92] with the cluster separation factor. Rezaee et al. [95] combined the compactness and separation factors, and Pakhira et al. [96] combined the same two factors where the separation factor was normalized. Recently, Rezaee [97] proposed a new cluster index in which the two factors were normalized across the range of possible numbers of clusters. These were introduced in Section 3.5.1. Here, we propose a fuzzy cluster validation method that uses the fuzzy partition and the distance matrix between cluster centers and data points. Instead of compactness and separation, fzBLE uses a Bayesian model and a log-likelihood estimator. With the use of both the possibility model and the probability model to represent the data distribution, fzBLE is appropriate for artificial datasets where the distribution follows a standard model, as well as for real datasets that lack a standard distribution. We show that fzBLE outperforms popular cluster indices on both artificial and biological datasets. 121 Method design - Bayesian & likelihood based approach Instead of using the two factors, compactness and separation, we propose a validation method (fzBLE) that is based on a log likelihood estimator with a fuzzy based Bayesian model. Given a fuzzy clustering solution modeled by = {U, V}, where V represents the cluster centers and U is the partition matrix representing the membership degrees of the data points to the clusters. The likelihood of the clustering model and the data is measured as L(0|X) = L(U,V|X) =np(x l iu,v)=nxp(v k )xp(x l iv k ), i=l i=l k=l where the log likelihood is computed using the following, log(L) = ]Tlog|XP(v k )xP( Xi |v k ) i=l Vk=l J (4.20) ^max. (4.21) Because our clustering model is possibility-based, to apply Equations (4.20) and (4.21), a transformation from possibility to probability is needed. Given a fuzzy clustering model = {U, V}, according to [98], Uki is the possibility that Vk = xi. If is a proper fuzzy partition, then there exists some x such that Uk(x ) = 1, k=l,. . .,c, and Uk is a normal possibility distribution. Assume Pk is the probability distribution of Vk on X, where pki > pk2 > pk3 >• • . > pkn- We associate with Pk a possibility distribution Uk on X [98] such that u^ is the possibility of Xi, where 122 u k „=nxp kn , u kl =i(Pki-Pk,i + i)+u k;i+1 , i = (n -!),...,!. (4.22) Reversing (4.22), we obtain the transformation of a possibility distribution to a probability distribution. Assume that Uk is ordered the same way with Pk on X: U k i > Uk2 > U k3 >. . > U kn . P kn =u kn /n, Pki=Pk, 1+ i+(u ki -u kji+1 )/i. (4.23) Pk is an approximate probability distribution of Vk on X, and pki = P(xi|vk). If Uk is a normal possibility distribution then Zpki = 1 . The data distributions Having computed the value of Pk, we can estimate the variance <Tk, the prior probability P(vk), and the normal distribution of Vk for cluster k [27]. V II II 2 (4.24) i=l iP(XiK) P(v k )=" ZlPCxJv,) (4.25) 1=1 i=l Pp(xjv k ) = |x,-v k | : 2 A (2<> 1/p XG k xe (4.26) In real datasets, for a cluster Vk, the data points usually come from different random distributions. Because they cluster in Vk, they tend to follow the normal 123 distribution of Vk estimated as in (4.26). This idea is based on the Central Limit Theorem. Theorem 4.1 (Central Limit Theorem): S n is the sum of n mutually independent random variables, then the distribution function of S n is well-approximated by a certain type of continuous function known as a normal density function, which is given by the formula i ( x ~^) 2 f^W = — s=e ^ , (4.27) GV27T where |u and a are the mean and variance of S n respectively. We integrate the probabilities computed in (4.23) and (4.27) for the probability of the data point xi given cluster Vk as P*(x, | v k ) = max{P(x 1 | v k ),P n (x 1 | v k )}. (4.28) Equation (4.28) better represents the data distribution, particularly in real datasets. Our method, fzBLE [147], is based on (4.21), (4.26) and (4.28). Method testing We tested fzBLE with other well-known cluster indices: partition coefficient - PC (3.23) and partition entropy - PE (3.24) [92], Fukuyama-Sugeno - FS (3.25) [93], Xie and Beni - XB (3.26) [94], Compose Within and Between scattering - CWB (3.28) [95], Pakhira-Bandyopadhyay-Maulik cluster index - PBMF (3.29) [96], and Babak Rezaee cluster index - BR (3.30) [97] using the artificial and real datasets. For each artificial dataset, we ran the standard FCM algorithm five times with m set to 2.0 and the partition matrix initialized randomly. In each case, the best fuzzy 124 partition was then selected to run fzBLE and the other cluster indices to search for the optimal number of clusters using numbers from 2 to 12 and to compare this with the known number of clusters. We repeated the experiment 20 times and averaged the performance of each method. Table 4.8 shows the fraction of correct predictions. fzBLE and PBMF outperform other approaches. The CF method, which is based on the compactness factor, was the least effective. Table 4.8 : Fraction of correct cluster predictions on artificial datasets #c fzble PC PE FS XB CWB PBMF BR CF 3 1.00 0.42 0.42 0.42 0.42 1.00 1.00 0.83 0.00 4 1.00 0.92 0.92 0.92 0.83 1.00 1.00 1.00 0.00 5 1.00 0.75 0.75 0.83 0.75 0.83 1.00 1.00 0.00 6 1.00 0.92 0.83 0.92 0.58 0.58 1.00 0.92 0.00 7 1.00 0.83 0.83 0.83 0.67 0.58 1.00 0.67 0.00 8 1.00 1.00 0.92 1.00 0.92 0.67 1.00 0.83 0.00 9 1.00 0.92 0.67 0.92 0.67 0.33 1.00 0.83 0.00 Table 4.9 : Validation method performance on the Iris dataset (3 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -763.0965 0.9554 0.0977 -10.6467 0.0203 177.1838 12.3280 1.1910 0.9420 3 -762.8034 0.8522 0.2732 -9.3369 0.1292 213.4392 17.7131 1.0382 0.3632 4 -764.8687 0.7616 0.4381 -7.4821 0.2508 613.2656 14.4981 1.1344 0.2665 5 -770.2670 0.6930 0.5703 -8.2331 0.3473 783.4697 13.6101 1.0465 0.1977 6 -773.6223 0.6549 0.6702 -7.3202 0.2805 904.3365 12.3695 1.0612 0.1542 7 .774.4740 0.6155 0.7530 -6.8508 0.2245 1029.7342 11.2850 0.9246 0.1262 8 -774.8463 0.6000 0.8111 -6.9273 0.3546 1635.3593 10.5320 0.8692 0.1072 9 -780.1901 0.5865 0.8556 -6.6474 0.3147 1831.5705 9.9357 0.7653 0.0905 10 -781.7951 0.5765 0.8991 -6.0251 0.2829 2080.3339 9.3580 0.7076 0.0787 125 Tables 4.9 - 4.11 show the test results on the Iris, Wine and Glass datasets. For the Iris dataset, only fzBLE and PBMF detected the correct number of clusters (Table 4.9). For the Wine and Glass datasets, only fzBLE and CWB, and only fzBLE, respectively, detected the correct number of clusters (Tables 4.10 and 4.1 1). Table 4.10 : Validation method performance on the Wine dataset (3 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -926.4540 0.9264 0.1235 -113.0951 0.1786 3.9100 1.3996 2.0000 61.1350 3 -924.0916 0.8977 0.1764 -104.9060 0.2154 3.2981 0.9316 1.4199 39.3986 4 -932.8377 0.8607 0.2525 -139.9144 0.5295 6.6108 0.6306 1.1983 33.7059 5 -929.6146 0.8225 0.3281 -126.5746 0.5028 6.9001 0.4700 1.0401 28.4741 6 -928.8121 0.8066 0.3669 -118.4715 0.6173 9.2558 0.3706 0.9111 25.3451 7 -930.6451 0.7988 0.3874 -120.3128 0.6465 10.3803 0.2972 0.7629 23.1742 8 -932.0462 0.7993 0.3917 -124.7999 0.6459 11.0836 0.2471 0.6392 21.4411 9 -932.1902 0.7929 0.4120 -122.8396 0.6367 11.8373 0.2100 0.5801 19.9154 10 -935.0478 0.7909 0.4217 -130.9089 0.6270 11.9941 0.1773 0.5252 18.9891 Table 4.1 1 : Validation method performance on the Glass dataset (6 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -1135.6886 0.8884 0.1776 0.3700 0.7222 6538.9311 0.3732 1.9817 0.5782 3 -1127.6854 0.8386 0.2747 0.1081 0.7817 4410.3006 0.4821 1.5004 0.4150 4 -1119.2457 0.8625 0.2515 -0.0630 0.6917 3266.5876 0.4463 1.0455 0.3354 5 -1123.2826 0.8577 0.2698 -0.1978 0.6450 2878.8912 0.4610 0.8380 0.2818 6 -1113.8339 0.8004 0.3865 -0.2050 1.4944 5001.1752 0.3400 0.8371 0.2430 7 -1116.5724 0.8183 0.3650 -0.2834 1.3802 5109.6082 0.3891 0.6914 0.2214 8 -1127.2626 0.8190 0.3637 -0.3948 1.4904 7172.2250 0.6065 0.5916 0.2108 9 -1117.7484 0.8119 0.3925 -0.3583 1.7503 8148.7667 0.3225 0.5634 0.1887 10 -1122.1585 0.8161 0.3852 -0.4214 1.7821 9439.3785 0.3909 0.4926 0.1758 11 -1121.9848 0.8259 0.3689 -0.4305 1.6260 9826.4211 0.3265 0.4470 0.1704 12 -1135.0453 0.8325 0.3555 -0.5183 1.4213 11318.4879 0.5317 0.3949 0.1591 13 -1138.9462 0.8317 0.3556 -0.5816 1.4918 14316.7592 0.6243 0.3544 0.1472 126 For the Yeast dataset, we ran the FCM algorithm with m set to 1.17 [60] and used the clustering partition to test all methods as in previous sections. Table 4.12 shows that only fzBLE detected the correct number of clusters (five) in Yeast dataset. Table 4.12 : Validation method performance on the Yeast dataset (5 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -2289.8269 0.9275 0.1172 -85.1435 0.2060 8.3660 1.2138 2.0000 133.0734 3 -2296.4502 0.9419 0.0983 -157.2825 0.2099 4.7637 0.6894 1.0470 94.6589 4 -2305.3369 0.9437 0.1000 -191.7664 0.2175 4.0639 0.5575 0.7240 74.7629 5 -2289.3070 0.9087 0.1648 -187.1073 1.0473 13.6838 0.4087 0.6722 65.9119 6 -2296.3098 0.8945 0.1939 -196.6711 0.9932 13.8624 0.3050 0.6170 60.8480 7 -2296.6017 0.8759 0.2299 -198.2858 1.0558 15.4911 0.2434 0.5686 56.1525 8 -2299.4225 0.8634 0.2526 -201.7688 1.0994 16.9644 0.2050 0.5132 51.2865 9 -2299.3653 0.8453 0.2871 -205.1489 1.2340 20.2532 0.1741 0.4819 48.0737 10 -2302.7581 0.8413 0.2992 -208.5687 1.1947 20.7818 0.1512 0.4533 45.9442 11 -2300.3294 0.8325 0.3186 -209.4023 1.1731 21.1525 0.1307 0.4272 43.6600 12 -2307.5701 0.8290 0.3272 -213.4658 1.2245 23.0389 0.1157 0.4040 42.1594 13 -2310.7819 0.8270 0.3354 -215.2463 1.3036 25.4062 0.1016 0.3847 40.8654 Table 4.13 : Validation method performance on the Yeast-MIPS dataset (4 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -1316.4936 0.9000 0.1625 25.4302 0.3527 16.7630 0.7155 1.9978 81.0848 3 -1317.3751 0.9092 0.1615 -32.8476 0.2981 10.1546 0.8032 1.2476 58.2557 4 -1304.0374 0.8216 0.3252 -39.4858 2.5297 39.8434 0.5400 1.3218 48.6275 5 -1308.6776 0.8279 0.3216 -54.4979 2.4245 34.9963 0.3620 0.9558 41.9671 6 -1309.9191 0.8211 0.3460 -59.8918 2.3511 35.4533 0.2691 0.8291 38.5468 7 -1315.3692 0.8139 0.3654 -65.4866 2.3562 38.8797 0.2423 0.7252 36.0906 8 -1315.1479 0.8062 0.3918 -67.6774 2.4958 43.9502 0.1966 0.6712 34.1387 9 -1321.2280 0.8109 0.3874 -72.3197 2.2854 41.2112 0.1664 0.6072 32.3289 10 -1324.1578 0.8158 0.3847 -74.7867 2.0433 37.6154 0.1395 0.5588 30.9686 127 For the Yeast-MIPS dataset, we ran the FCM algorithm using the same parameters as with the Yeast dataset. The results in Table 4.13 show that only fzBLE correctly detected the four clusters in the Yeast-MIPS dataset. For the RCNS (Rat Central Nervous System) dataset, we ran fzBLE and the other cluster indices on the dataset clustering partition found by the standard FCM algorithm using the Euclidean metric for distance measurement. Table 4.14 shows that, again, only fzBLE detected the correct number of clusters. Table 4.14 : Validation method performance on the RCNS dataset (6 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -580.0728 0.9942 0.0121 -568.7972 0.0594 5.5107 4.2087 1.1107 177.8094 3 -564.1986 0.9430 0.0942 -487.6104 0.4877 4.1309 4.2839 1.6634 117.9632 4 -561.0169 0.9142 0.1470 -430.4863 0.9245 6.1224 3.3723 1.3184 99.1409 5 -561.7420 0.8900 0.1941 -397.0935 1.3006 9.4770 2.6071 1.1669 88.5963 6 -552.9153 0.8695 0.2387 -300.6564 2.5231 20.6496 1.9499 1.1026 84.0905 7 -556.2905 0.8707 0.2386 -468.3121 2.1422 21.0187 2.8692 0.7875 57.5159 8 -555.3507 0.8925 0.2078 -462.0673 1.7245 20.0113 2.5323 0.5894 52.0348 9 -558.8686 0.8863 0.2192 -512.4278 1.6208 22.4772 2.6041 0.5019 45.9214 10 -565.8360 0.8847 0.2241 -644.1451 1.1897 21.9932 3.4949 0.3918 33.1378 We presented a novel method that uses the log likelihood estimator with a Bayesian model and the possibility, rather than the probability, distribution model of the dataset from the fuzzy partition. By using the Central Limit Theorem, fzBLE effectively represents distributions in real datasets. Results have shown that fzBLE performs effectively on both artificial and real datasets. 128 4.2.4 Fuzzy clustering evaluation using Gene Ontology [180] The shortcoming of fzBLE is that it is solely based on the data and, therefore, unable to apply prior biological knowledge into cluster evaluation. We developed GOfzBLE, a cluster validation method that generates the fuzzy partition for a given clustering solution using GO term based semantic similarity and applies fzBLE method to evaluate the clustering solution using the fuzzy partition. Gene Ontology The Gene Ontology (GO) [181] is a hierarchy of biological terms using a controlled vocabulary that includes three independent ontologies for biological process (BP), molecular function (MF) and cellular component (CC). Standardized terms known as GO terms describe roles of genes and gene products in any organism. GO terms are related to each other in the form of parent-child relationships. A gene product can have one or more molecular functions, can participate in one or more biological processes, and can be associated with one or more cellular components [182]. As a way to share knowledge about functionalities of genes, GO itself does not contain gene products of any organism. Rather, expert curators specialized in different organisms annotate biological roles of gene products using GO annotations. Each GO annotation is assigned with an evidence code that indicates the type of evidence supporting the annotation. Semantic similarity GO is structured as directed acyclic graphs (DAGs) in which the terms form nodes, and the two kinds of semantic relationships, "is-a" and "part-of , form edges [184]. "is-a" is a simple class-subclass relation, where A is-a B means that A is a 129 subclass of B. 'part-of ' is a partial ownership relation; C part-of D means that whenever C is present, it is always a part of D, but C need not always be present. The structure of DAG allows assigning a metric to a set of terms based on the likeliness of their meaning content which is used to measure semantic similarity between terms. Multiple GO based semantic similarity measures have been developed [183, 184], and are increasingly used to evaluate the relationships between proteins in protein-protein interactions, or co-regulated genes in gene expression data analysis. Among of the existing semantic similarity measurement methods, that of Resnik is most appropriate for gene expression data analysis because it is strong correlated with gene sequence similarities and gene expression profiles [183]. However, Wang et al. [184] had shown that the Resnik' s method has a drawback in that it uses only the information content derived from annotation statistics which is not suitable for measuring semantic similarity of GO terms. We therefore propose to use Wang's method for GO semantic similarity measurement. For each term A, a semantic value, S(A), is computed as: S(A)=£S A (t), (4.29) teT A where T A is a set of terms including the term A and its ancestors, and Sa(.) is the sematic value regarding the term A, defined as in (4.30), S A (t)H > ( 4 - 3 °) AV [max{w?xS A (u),ueChildrenOf(t)},t*A where w" is the semantic contribution factor [184] for edge connecting term t with its child, term u. The semantic similarity between two terms, A and B, is defined as: 130 X(S A (t) + S B (t)) S GO (A,B) = ^^ . (4.31) GO SV(A) + SV(B) Using the semantic similarities between GO terms, the semantic similarity between the two GO term sets, Gl and G2, is defined as: £Sim(gl,G2)+ £sim(g2,Gl) Sim(Gl,G2) = ^i g2eG2 , (4.32) Gl + G2 where Sim(t,T) is the similarity between the term t and the term set T and is defined as in (4.33), Sim(t, T) = max{S G0 (t, u)} . (4.33) ueT Method design For a given crisp clustering solution of gene expression data, an approximate fuzzy partition is generated based on the gene-gene semantic similarities using GO term annotations. The fuzzy partition is applied to the fzBLE method for evaluation of the clustering solution. Each crisp clustering solution is modeled with C = {M, V}, where V represents the cluster centers and M, M={mki}; niki e {0,1}; k=l,...,c; i=l,...,n, is the crisp c-partition matrix representing the membership of the data points to the clusters. In case where a fuzzy clustering solution 0, = {U, V}, is given, C can be computed based on using our defuzzification method for fuzzy partition [154]. Fuzzification of a crisp clustering partition- In order to use GO terms to validate crisp clustering solution, a fuzzy partition is generated based on the crisp partition using GO semantic similarity. This is done by assigning the crisp partition into the space of 131 GO terms. Each data object xi, i=l,..., n, is corresponding with a vector of GO annotations x G0 , and a vector of degrees of belief (DOB) x CF , where x CF is the DOB of the term x G0 annotated to xi (Table 4.15). Because GO annotations of a gene/gene product may come from different sources (Table 4.16), use of DOBs can help with combining multiple annotations using the same term of the same data object. Table 4.15 : Degrees of belief of GO annotation evidences Evidence Degree of belief EXP 1.0 IDA, IPI, TAS 0.9 IMP, IGI, IEP 0.7 ISS, ISO, ISA, ISM 0.4 IGC 0.2 IBA, IBD, IKR, IRD, RCA 0.3 NAS, IC, ND, NR 0.0 IEA 0.1 The GO annotations for each cluster Vk, v k ° for k=l,..., c, can be determined using the GO annotations of the members of cluster Vk: v k G0 = U x G0 (4.34) GO r CF For each annotation v kt of Vk, t=l,..., v k , its DOB, v kt , is computed using the DOBs of cluster's members annotated with term t: v- =mean(xf ,x e v k ,x G ° = v°°J (4.35) 132 Table 4.16 : Gene Ontology evidence codes Experimental Evidence Codes EXP Inferred from Experiment IDA Inferred from Direct Assay IPI Inferred from Physical Interaction IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IEP Inferred from Expression Pattern Computational Analysis Evidence Codes ISS Inferred from Sequence or Structural Similarity ISO Inferred from Sequence Orthology ISA Inferred from Sequence Alignment ISM Inferred from Sequence Model IGC Inferred from Genomic Context IBA Inferred from Biological aspect of Ancestor IBD Inferred from Biological aspect of Descendant IKR Inferred from Key Residues IRD Inferred from Rapid Divergence RCA inferred from Reviewed Computational Analysis Author Statement Evidence Codes TAS Traceable Author Statement NAS Non-traceable Author Statement Curator Statement Evidence Codes IC Inferred by Curator ND No biological Data available Automatically-assigned Evidence Codes IEA Inferred from Electronic Annotation Obsolete Evidence Codes NR Not Recorded 133 We apply the semantic similarity measure as in (4.32) to compute the semantic similarity between cluster Vk and data point xi using their GO annotations. We propose a modification of (4.31) so that it can be used with DOB. For the term v£° of cluster Vk and the term x?° of data point xi, their semantic similarity is defined as: Qf / TT GO ^GO\ c S^rGO v GO\ WWk - s CF v CF \ s A ^r\ SooKt > x ij ) = S Go( v kt > x ij ) xmm (v kt9 x 1J ) (4-36) We use (4.36) instead of (4.31) when computing the semantic similarity between Vk and Xi, Sim^v^Xi), using (4.32). The distance between Vk and Xi based on GO semantic similarity is then defined as: d^(v k ,x 1 ) = d 2 (l,Sim'(v k ^ 1 )). (4.37) A fuzzy partition model using GO terms space 0go, 0go = {Ugo, _}, of C is now determined where Uqo is computed as in (4.9) using the distance function defined in (4.37). 0qo is used with fzBLE for the clustering solution evaluation. Because 0go is based on GO annotation, we proposed a method to compute the prior probability of Vk, Pp(vk), k=l,..., c, using the prior probabilities of the GO terms used by Vk, T. For a given term t, teT, sup (v£) P(t|v k ) = U£V -7 c . , (4.39) sum(v k ) 134 where P p (t) and P(t|vk) are the prior probability of the term t and the conditional probability of the term t given Vk respectively. The prior probability of Vk, k=l,..., c, is computed as in (4.40), P p (v k )=EP(t,v k )xP p (t) = XP(t|v k )xP(v k )xP p (t) (4.40) Method testing We tested GOfzBLE with fzBLE and the other cluster indices as in the Section 4.2.3 using the Yeast, Yeast-MIPS and RCNS datasets. Table 4. 17 : Validation method performance on the Yeast dataset using GO-BP #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -4832.656 0.928 0.117 -85.144 0.206 8.366 1.214 2.000 -2289.827 3 -4832.542 0.942 0.098 -157.283 0.210 4.764 0.689 1.047 -2296.450 4 -4832.751 0.944 0.100 -191.766 0.218 4.064 0.558 0.724 -2305.337 5 -4831.548 0.909 0.165 -187.107 1.047 13.684 0.409 0.672 -2289.307 6 -4831.890 0.895 0.194 -196.671 0.993 13.862 0.305 0.617 -2296.310 7 -4833.182 0.876 0.230 -198.286 1.056 15.491 0.243 0.569 -2296.602 8 .4833.644 0.863 0.253 -201.769 1.099 16.964 0.205 0.513 -2299.423 9 -4832.358 0.845 0.287 -205.149 1.234 20.253 0.174 0.482 -2299.365 10 -4832.246 0.841 0.299 -208.569 1.195 20.782 0.151 0.453 -2302.758 11 -4832.600 0.833 0.319 -209.402 1.173 21.153 0.131 0.427 -2300.329 12 -4832.668 0.829 0.327 -213.466 1.225 23.039 0.116 0.404 -2307.570 13 -4832.217 0.827 0.335 -215.246 1.304 25.406 0.102 0.385 -2310.782 For the Yeast dataset, only GOfzBLE and fzBLE correctly identified the number of clusters (Table 4.17). Table 4.18 again shows that these two algorithms detected correctly the number of clusters. While fzBLE works based on gene expression levels, 135 GOfzBLE works based on GO-BP annotations. The results also show that GO-BP annotations are strongly correlated with gene-gene differential co-expression patterns. In other words, we may utilize either gene-gene co-expression patterns in gene expression data or GO-BP annotations to search for genes having similar functionality. Table 4.18 : Validation method performance on the Yeast-MIPS dataset using GO-BP #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -2288.413 0.900 0.163 25.430 0.353 16.763 0.716 1.998 -1316.494 3 -2286.843 0.909 0.162 -32.848 0.298 10.155 0.803 1.248 -1317.375 4 -2283.854 0.822 0.325 -39.486 2.530 39.843 0.540 1.322 -1304.037 5 -2285.069 0.828 0.322 -54.498 2.425 34.996 0.362 0.956 -1308.678 6 -2286.252 0.821 0.346 -59.892 2.351 35.453 0.269 0.829 -1309.919 7 -2286.834 0.814 0.365 -65.487 2.356 38.880 0.242 0.725 -1315.369 8 -2287.543 0.806 0.392 -67.677 2.496 43.950 0.197 0.671 -1315.148 9 -2288.333 0.811 0.387 -72.320 2.285 41.211 0.166 0.607 -1321.228 10 -2288.954 0.816 0.385 -74.787 2.043 37.615 0.140 0.559 -1324.158 Table 4.19 : Validation method performance on the RCNS dataset using GO-CC #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -970.340 0.994 0.012 -568.797 0.059 5.511 4.209 1.111 -580.073 3 -970.460 0.943 0.094 -487.610 0.488 4.131 4.284 1.663 -564.199 4 -969.909 0.914 0.147 -430.486 0.925 6.122 3.372 1.318 -561.017 5 -969.669 0.890 0.194 -397.094 1.301 9.477 2.607 1.167 -561.742 6 -969.659 0.870 0.239 -300.656 2.523 20.650 1.950 1.103 -552.915 7 -970.148 0.871 0.239 -468.312 2.142 21.019 2.869 0.788 -556.291 8 -969.787 0.893 0.208 -462.067 1.725 20.011 2.532 0.589 -555.351 9 -970.300 0.886 0.219 -512.428 1.621 22.477 2.604 0.502 -558.869 10 -970.487 0.885 0.224 -644.145 1.190 21.993 3.495 0.392 -565.836 136 The RCNS dataset has six groups of similarly expressed genes, two of which are invariant [87]. We first ran GOfzBLE using GO-CC annotations and compared with the other methods. The results are shown in Table 4.19. Only GOfzBLE and fzBLE identified six clusters in the dataset, corresponding to the six groups in the analysis results performed by Wen et al. [87]. We then reran GOfzBLE using GO-BP annotations and compared with the other methods. The results are shown in Table 4.20. GOfzBLE identified four clusters in the dataset that are corresponding to the four stages in the developmental process. The result of GOfzBLE again shows that GO-BP annotations are strongly correlated with gene-gene differential co-expression patterns. It also shows that GO-BP annotations are useful in creating effective external criteria for cluster analysis of gene expression data. Table 4.20 : Validation method performance on the RCNS dataset using GO-BP #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -1373.999 0.994 0.012 -568.797 0.059 5.511 4.209 1.111 -580.073 3 -1373.935 0.943 0.094 -487.610 0.488 4.131 4.284 1.663 -564.199 4 -1373.776 0.914 0.147 -430.486 0.925 6.122 3.372 1.318 -561.017 5 -1374.208 0.890 0.194 -397.094 1.301 9.477 2.607 1.167 -561.742 6 -1374.496 0.870 0.239 -300.656 2.523 20.650 1.950 1.103 -552.915 7 -1374.526 0.871 0.239 -468.312 2.142 21.019 2.869 0.788 -556.291 8 -1374.811 0.893 0.208 -462.067 1.725 20.011 2.532 0.589 -555.351 9 -1375.199 0.886 0.219 -512.428 1.621 22.477 2.604 0.502 -558.869 10 -1375.426 0.885 0.224 -644.145 1.190 21.993 3.495 0.392 -565.836 4.2.5 Imputation methods for partially missing data As discussed earlier in Section 3.4, given a data point Xi with missing attributes, the global model based approach updates the missing values of Xi using the information of the whole data. In contrast, the local model based approach does that using the 137 information of the data points in the neighborhood of xi. The former has an advantage of eliminating the noise that can be caused by the data points around xi, particularly in the case when xi is in the overlapping region. The latter, however, has the advantage that it may provide the data distribution model that xi is following. Recent research has shown that the global model based approach, the optimal completion strategy (OCS) is an example, outperforms the local model based one [139, 140, 143]. Existing approaches usually update X M using either a global model, or a local model. Luo et al. [141], by using a local update model integrated into the FCM algorithm, showed that his method outperformed two well-known methods: K-nearest neighborhood (KNN) and Local Least Squares K-nearest neighborhood (SKNN). However, the method of Luo et al. is only based on the local model information. Hence, it may fail at the overlapping region. Mohammadi et al. [142] improve the work of Luo et al. by using an external distance based on GO terms. The relationships of the terms in GO in fact are locally detected, methods using them are just a kind of local model. Regarding the problem of missing data, we developed separately two new algorithms using our fzBLE and fzSC methods, namely probability based (fzPBI) and density based (fzDBI) imputation algorithms, to create "abstract" fuzzy partition based on the partition derived from the partially distance strategy method (PDS) [140]. These abstract partitions not only contain the global distribution model of the data, but they also represent the local distributions at every data points in the dataset. We then developed new update models for X M using both the global and local data distribution models derived from the abstract partitions [154, 155]. 138 4.2.6 Probability based imputation method (fzPBI) [154] Method design The objective of fzPBI is to cluster a dataset X, that may be incomplete, into c clusters, X m are estimated during the clustering process with respect to the optimization of the function J m . Let X w = {xigX | Xi is a complete data object}, X P = {xy | Xy ^ ?}, and X M = {xy | xy= ?}. The distance measure, d (.) therefore is defined as: where w,- indicates the contribution degree of the j th attribute of xi, Xy, in the distance between xi and Vk. If Xy e X m , w,- increases with each iteration to avoid premature use of estimated values. Therefore, we define Wj as: f 1 x, e X p W Ht/T Y ' (4 ' 42) [t/T x { . g X M where t is the iteration index, and T is the number of iterations; < t < T. To impute the values of Xm, we applied our fzBLE method [148] to generate a probabilistic model of the data distribution using the fuzzy partition. This model is then used to impute the missing values. For each cluster Vk, k=l,...,c, a probability distribution {pki}i=i,...,n is derived from the possibility distribution {uki}i=i,..., n as in (4.22) and (4.23). The data distributions model is then estimated using (4.24), (4.25) and (4.26). The model for missing data estimation is the developed as follows: 139 Let P k m ={Pki>Pk2>--->Pw,} ^ e a set of probabilities at v k , where p£L j=l,...,p, indicates the probability that attribute j is missing in cluster k, defined as: t^^-h) n m = ■ i=1 Pki IP(x„v k ) (4.43) i=l where Jlx y eX p IJ 10x s eX M Hence, the probability that a data object xi has a missing attribute j in cluster k is: P(x M(ij)|k ) = P(v k )p™ Ffp£ t=l,t*j P(x,|v k ). (4.44) Because each cluster Vk can be considered a component of the data distribution model of X, the estimated value of x y - e X M is computed as: X ij _ ^^V X M(ij)|k fk / Zj^V X M(ij)|k )• (4.45) k=l k=l To avoid an oscillation in the missing value imputation, estimated values are used decreasingly. In contrast to their usage in the clustering process, their contribution degree in distance measurement during the estimation process is defined as, 1 X^GXp J ll-t/T Xlj eX M (4.46) 140 fzPBI algorithm Steps 1) t=0; initialize U t randomly w.r.t (3.2). 2) Compute {d 2 (x i9 v k )}i=i.. n ,k=i..c as in (4.41) and (4.42). 3 ) Compute U t+ 1 using (3.3). 4) Compute V t+1 using (3.4). 5) If (t > T) or (|j^ +1 - r m \\ < c) then Stop. 6) Compute {d 2 (xi, Vk)}i=i.. n ,k=i..c using w- as in (4.46). 7) Create a probabilistic data distribution model from the fuzzy partition using (4.24), (4.25) and (4.26). 8) Create a probabilistic model for X M using (4.43) and (4.44). 9) Estimate X M using (4.45). t=t+l. Go to Step 2. The difference between fzPBI and the standard FCM algorithm is that fzPBI provides a method to discover the probabilistic model of the data distribution from the fuzzy partition and to apply the model to missing value imputation. fzPBI therefore can address the problem of missing data itself. Algorithm performance measures We used two measures to evaluate algorithm performance. The first measure is the root mean square error (RMSE) between the true values and the imputed values, defined as: 141 RMSE= U^fc-xJ (4.47) where ni is the number of missing values imputed. The second measure assesses the overall performance determined by the number of data objects with missing attributes that were misclassified. This assessment is done by comparing the cluster label of each data object with its actual class label. If the two match, there is no misclassification. If they do not match, then a misclassification has occurred, defined in (3.21). Method testing s 52 47 42 /.♦•' 37 /y fzPBI i/> 32 OCS 2 11 11 ~A« NPS •••♦■•• FCMimp CIAO FCMGOimp 17 \ 12 f ^iB^ 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-6 : Average RMSE of 50 trials using an incomplete ASET2 dataset with different percentages of missing values To evaluate the performance of fzPBI, we compared fzPBI with six popular imputation methods: PDS, OCS and NPS [140]; FCMimp [158]; CIAO [159]; and FCMGOimp [160]. For each test dataset, we generated the missing data using different 142 percentages of missing values. The value of the fuzzifier factor, m, was set to 2.0 for the artificial datasets and the Iris, Wine and RCNS datasets. For the Yeast, Yeast-MIPS and Serum datasets, m was set to 1.17 and 1.25 respectively, as in [60]. The number of clusters, c, was set to the known number of clusters. Each algorithm was run 5 times and the best result recorded. We repeated the experiment 50 times and averaged the performance of each algorithm using both measures. 53 /.* 48 43 u 38 CO cc // /■' * /■■* fzPBI OCS —A— NPS 33 /••■ •••♦•• FCMimp 28 /.■■' > ***.. - CIAO FCMGOimp 23 - " "" 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-7 : Average RMSE of 50 trials using an incomplete ASET5 dataset with different percentages of missing values Regarding the artificial datasets, ASET2 and ASET5; ASET2 has five well- separated clusters of similar size. ASET5 is more complex, containing three clusters that differ in size and density (Figure 3-16). On ASET2, fzPBI had the lowest RMSEs across different scales of missing data (Figure 4-6) and therefore performed better than 143 the other methods. On ASET5, fzPBI again performed best (Figure 4-7), although performance of the other methods was comparable. Applied to the Iris dataset, fzPBI and CIAO had the smallest RMSEs, although, as shown in Figure 4-8, fzPBI performed marginally better. Table 4.21 shows that, compared with CIAO, fzPBI had a smaller number of misclassified objects across different scales of missing values. 2.8 2.3 1.3 0.8 I // r S .'/ ,/ s / y v // fzPBI OCS --A--NPS •••♦■•• FCMimp CIAO FCMGOimp 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-8 : Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values On the Wine dataset, on average, both fzPBI and CIAO outperformed the other methods. fzPBI and CIAO performed equally on data with percentages of missing values of 10% and lower. However, fzPBI performed slightly better than CIAO on data with higher percentages of missing values. FCMimp and FCMGOimp performed better than other methods, when the percentage of missing values was small (Figure 4-9). 144 However, they performed worse than all other methods on datasets with high percentages, i.e., greater than 15%, of missing values. Table 4.21 : Average results of 50 trials using an incomplete IRIS dataset with different percentages (%) of missing value % Averaged #objects misclassified fzPBI PDS ocs NPS FCMimp CIAO FCMGOimp 5 15.9 18.7 16.0 16.2 18.4 15.9 18.0 7 15.9 12.6 16.3 16.8 13.0 15.9 13.2 10 15.9 10.0 16.3 17.7 9.8 16.0 9.5 13 16.0 9.1 17.1 19.0 9.5 16.2 9.0 15 16.1 12.8 18.4 20.7 12.2 16.3 9.9 20 15.9 20.6 19.8 22.9 20.3 16.0 23.7 25 16.2 31.9 20.5 24.8 30.8 16.1 32.8 30 16.7 37.9 26.2 30.9 37.9 16.7 38.9 40 16.7 49.3 30.8 37.9 50.8 17.0 57.8 50 21.2 56.5 42.6 52.3 57.1 21.8 64.1 166.5 r' il il 146.5 ? fzPBI uj126.5 - 1 OCS 2 cc t —A— NPS * * •••♦■•• FCMimp 106.5 CIAO 86.5 ■y,.^/ "••• v FCMGOimp ^s 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-9 : Average RMSE of 50 trials using an incomplete Wine dataset with different percentages of missing values 145 / 0.211 / / 0.201 / 0.191 0.181 / / > fzPBI LU | 0.171 OCS —A« NPS 0.161 0.151 / •••♦•• FCMimp CIAO / s<2r FCMGOimp 0.141 / 4^ 0.131 - ^/// — ^ n 1 01 . U.1Z1 i i i i i i i i i 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-10 : Average RMSE of 50 trials using an incomplete RCNS dataset with different percentages of missing values On the RCNS dataset (Figure 4-10), fzPBI outperformed other algorithms when the missing value percentage was 40% and lower. However, it performed marginally worse with a missing value percentage of 50%. This is because the RCNS dataset contains only 112 data points and becomes very sparse with high percentages of missing values; the data probability model was not properly determined, and fzDBI, therefore, could not perform correctly. 146 2.91 - 2.41 E 1.91 2 1.41 0.91 0.41 // If y ^ -V y fzPBI OCS — A— NPS •••♦■•• FCMimp CIAO FCMGOimp , H 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-11: Average RMSE of 50 trials using an incomplete Yeast dataset with different percentages of missing values fzPBI outperformed other methods on the Yeast and Yeast-MIPS datasets (Figures 4-11 and 4-12); FCMGOimp was run using the GO term-based distance measure [11]. However, FCMGOimp outperformed only FCMimp. This result is similar to that reported in [158]. Using GO terms to measure the distance between genes is interesting. However, a crisp distance measure of {0,1}, where is the distance between a pair of genes having at least one GO term in common, does not help much with the problem of missing value imputation. fzPBI again outperformed other algorithms on Serum dataset with the percentage of missing value lower than 50% (Figure 4-13). For the rest of the cases, fzPBI was only outperformed by CIAO. On average, fzPBI performed best on the Serum dataset. 147 3.88 h // // 3.38 f/ il 2.88 'J 1 i fzPBI 2.38 LU CO or 1.88 1.38 A y OCS —A« NPS •••♦•• FCMimp CIAO FCMGOimp 0.88 -^.^ ,— «— — f.f.«B l .nnn.i«r.r..:nr.r.'r:r.r.r.". 0.38 ) 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-12 : Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values 0.17 -1 ♦ / / / 0.15 - / / / 0.13 - / / A /■ 0.11 - / / / /* 0.09 - *.y ,.-t < 0.07 - 5 7 10 13 15 20 25 30 40 50 Percentage of missing value fzPBI OCS —A— NPS •••♦■•• FCMimp CIAO FCMGOimp Figure 4-13 : Average RMSE of 50 trials using an incomplete Serum dataset with different percentages of missing values 148 4.2.7 Density based imputation method (fzDBI) [155] Similar to fzPBI, the fzDBI algorithm was developed to cluster a dataset X, that may be incomplete, into c clusters based on fuzzy partition generated using partial distance measure as in (4.41) and (4.42). The abstract fuzzy partition is derived from the FCM fuzzy partition using our fzSC method, as follows: Let = {U,V} be the fuzzy partition generated by FCM, which is a distance- based fuzzy partition describing the data density at the cluster centers. The accumulated density at the center of cluster Vk, k=l,...,c, is calculated as in (4.11). However, cannot properly describe the data density at every data point of X (4.2.2). A strong uniform fuzzy partition is then defined as in (4.15) and (4.16). The data density at every data point is then computed using (4.17). A set V, V = {v 1? v 2 ,...,v c } of the most dense data points, Cd, are then selected as cluster candidates. Because V represents the data densities in the dataset, we construct a density-based fuzzy partition© , = {U, V} , where Uis defined as: fi H =ZP(v,|v k )xui/2;P(v I |v k ). (4.48) 1=1 / 1=1 The estimated value of Xy e X M is computed based on both the FCM fuzzy partition, 0, and the density-based fuzzy partition, , x y =aJu k > kj /ju k n ; + (l-a)Xu k n :v kj /Xu k n :, (4.49) k=l / k=l k=l / k=l 149 where a, 0<a<l, indicates the contribution level of each of the two fuzzy partition models, and 9 , in the missing value imputation. We set a = 0.5 so that both the models contribute equally. To avoid an oscillation in the missing value imputation, estimated values are used decreasingly. In contrast with their usage in the clustering process, their contribution degree in distance measurement during the estimation process is defined as in (4.46). fzDBI algorithm Steps 1) t = 0;c = Vn. 2) Initialize U* randomly w.r.t (3.2). 3) Compute {d 2 (x i9 v k )}i=i,...,n,k=i,...,c as in (4.41) and (4.42). 4) Compute U t+1 using (3.3). 5) Compute V t+1 using (3.4). 6) If (t > T) or ( |j£i _ r m \\ < s ) then Stop. 7) Compute {d 2 (x i9 v k )}i=i,..., n ,k=i,...,c using w ■ as in (4.46). 8) Create a strong uniform fuzzy partition as in (4. 1 5) and (4. 1 6). 9) Create a density-based fuzzy partition of X, 9 , as in (4.48). 10) Estimate X M using (4.49). t=t+l. Go to Step 3. The difference between fzDBI and the standard FCM algorithm is that, in addition to a distance-based fuzzy partition, fzDBI generates a density-based fuzzy partition which is used with the distance-based one in missing value imputation. Method testing fzDBI was tested the same way with fzPBI, compared with the six popular imputation methods: PDS, OCS and NPS [140]; FCMimp [158]; CIAO [159]; and 150 FCMGOimp [160], fzDBI performed better on datasets of different levels (Figures 4-14 -4-17). 54 / 49 44 s s 39 - s s fzDBI uj 34 - <*• OCS 2 cc 29 —A— NPS •••♦•• FCMimp 24 CIAO 19 FCMGOimp :r 14 - 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-14 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset with different missing value percentages 54 * / • 49 / / / 44 J f fzDBI 39 / LU I/) OCS 2 ..**" «A— NPS 34 .S" •••♦■•• FCMimp 29 i CIAO FCMGOimp 24 f- '-■■- ' - — ' 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-15 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset with different missing value percentages 151 2.75 - 2.25 1.75 1.25 0.75 0.25 fzDBI OCS — A— NPS •••♦■•• FCMimp CIAO FCMGOimp 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-16 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values 3.8 3.3 - f 2.8 t / 2.3 LU I/) 1.8 1.3 / > y y y y fzDBI OCS —A« NPS •••♦■•• FCMimp CIAO FCMGOimp 0.8 i 0.3 4 1 1 1 1 1 1 1 1 1 5 7 10 13 15 20 25 30 40 50 Percentage of missing value Figure 4-17 : (fzDBI) Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values 152 4.2.8 Probability based defuzzification (fzPBD) [156] To generate classification information using fuzzy partition, the fuzzy partition needs to be defuzzified for crisp classification information of the data objects. A popular approach to this problem is to use the maximum membership degree principle (MMD). This may be inappropriate in some applications, particularly applications where the data are non-uniform, because FCM membership is computed using distance between the data object and cluster center. Use of membership degree can assign marginal objects of a large cluster to the immediately adjacent small cluster. Recent solutions include that of Chuang et al. [150], who proposed using spatial information to adjust the membership status of every data point using the membership status of its neighbors. In a similar approach, Chiu et al. [149] used spatial information to determine the distance between the data point and cluster center. However, these methods require definition of neighborhood boundaries. Genther et al. [151] proposed defining spatial information by the clustering structure, not the neighborhood, to compute the distance from the data point to cluster center. A common limitation of methods that use spatial information in computing membership degree is that they have to scale between the actual position information of the data point and its spatial information. In addition, while use of spatial information is appropriate for image segmentation, it may not work with generic data cluster analysis because it is difficult to define neighborhood boundaries. 153 We developed fzPBD that uses the fuzzy partition and the data themselves to construct a probabilistic model of the data distributions. The model is then applied to produce the classification information of data points in the dataset. Method design We applied the statistics model in our fzBLE method [141] to generate a probabilistic model of the data distribution using fuzzy partition. Given a fuzzy partition matrix U, the vector Uk = {uki}i=i,..., n , k=i,...,c, is a possibility model of the data distribution of Vk on X. We associate {Pk}, the probability distributions, with the possibility distributions {Uk} of Vk on X, k=l,...,c, using (4.22) and (4.23). The statistics at Vk, k=l,...,c, are then derived using (4.24), (4.25) and (4.26). Using the idea based on the Central Limit Theorem, "The distribution of an average tends to be normal, even when the distribution from which the average is computed is decidedly non-normal'' the data objects can be assumed to follow the normal distributions at {vk}k=i,...,c, defined as in (4.26). Therefore, the data object Xi is assigned to the class of Vk, where P(v k |x i ) = max{P(v 1 |x i )}. (4.50) 1=1. .c Because P(v k |xO = P(xi,v k )/P(xi) = P(xi|v k )xP(v k )/P(xi), an alternative to (4.50) is P(v k |x i ) = max{P(x i | Vl )xP( Vl )}, (4.51) 1=1. .c where P(vi), 1=1,. . .,c, the prior probability of vi, can be computed using (4.25). 154 fzPBD algorithm Steps 1) Convert the possibility distributions in U into probability distributions using (4.22) and (4.23). 2) Construct a probabilistic model of the data distributions using (4.24) and (4.26). 3) Apply the model to produce the classification information for every data point using (4.51). Method testing We compared fzPBD with the MMD, FCB, NBM and sFCM methods on ASET2, ASET3, ASET4 and ASET5 using the compactness measure (3.43) and misclassification measure (3.21) for evaluation of algorithm performance. For each dataset, the following values of the fuzzifier factor, m were used: 1.25, 1.375, 1.50, 1.625, 1.75, 1.875 and 2.0. The number of clusters, c, was set to the known number of clusters. The FCM algorithm was run 3 times and the best fuzzy cluster partition was selected to test all the algorithms. We repeated the experiment 100 times and averaged the performance of each algorithm using the two measures. 155 ■ fzPBD — ♦ -MMD •••*••• FCB NBM sFCM E 2 - 5 Q. ro 4-» (0 ■D "35 u» _ro u E 1 " M- o a lo.5 ^ Z n ■ h H- - h ~H~ ^ --+"' ....x ..X (• x *•••• X-'''' ...^ , '''' u ■ 1. 25 1.375 1.5 1.625 Fuzzifier (m) 1.75 1.875 2 Figure 4-18 : Algorithm performance on ASET2 1 1 ^ 1 1 • 1- 0.4 1 0.35 - Q. re 0.3 - ■a ■o «S 0.25 - ■ fzPBD u» — ♦ -MMD 8 0.2 - £ •••*•• FCB *o 15 ) (, U£ W NBM sFCM £ n 1 - fc 0.1 """"* ••••x- x •••••X z 0.05 I 1.25 1.375 1.5 1.625 Fuzzifier (m) 1.75 1.875 2 Figure 4-19 : Algorithm performance on ASET3 ASET2 and ASET3 each contain five well-separated clusters of the same sizes. ASET1 and ASET2 are in two-dimensional and three-dimensional data space, respectively. Performance of all algorithms is shown in Figures 4-18 and 4-19. fzPBD, MMD and sFCM generated no misclassification across multiple levels of m. 156 12 10 *••••■ * -x- x- • Si' -x- *••• V. S *. VI 1 ■ 1 . II — ■ i — ■ — ■ — — ■— ■ _ro b - » mm 1.25 1.375 1.5 1.625 Fuzzifier (m) 1.75 1.875 ■ fzPBD — ♦ -MMD •••*••• FCB NBM sFCM Figure 4-20 : Algorithm performance on ASET4 ASET4 contains 11 well-separated clusters in a five-dimensional data space. Results are shown in Figure 4-20. fzPBD and MMD performed best. Table 4.22 shows that they also had the smallest compactness measures across multiple levels of m. Table 4.22 : ASET4- Compactness measure Algo. Fuzzifier factor m 1.25 1.375 1.5 1.625 1.75 1.875 2.00 fzPBD 0.39 0.39 0.39 0.39 0.39 0.39 0.39 MMD 0.39 0.39 0.39 0.39 0.39 0.39 0.39 FCB 4.52 4.52 4.52 4.52 4.53 4.54 4.54 NBM 12.85 11.28 8.73 8.72 8.72 8.72 8.72 sFCM 4.94 3.2 0.45 0.49 0.49 0.5 0.5 157 ASET5 is non-uniform with three clusters in a two-dimensional data space (Figure 3-16). Results are shown in Figure 4-21. fzPBD outperformed all the other algorithms. K__ --+-- __+ - -+- H - + 25 - +■» C o Q. ro +■» ro -a ■a o 20 - 15 - ■ fzPBD ^_ # -♦- ♦— +"—' in — ♦ -MMD in E 10 - •••X--- FCB NBM .Q sFCM 3 z S (.... yd $Kr • • • • *• —JK— -x- ••••X I 1.25 1.375 1.5 1.625 Fuzzifier (m) 1.75 1.875 2 Figure 4-21 : Algorithm performance on ASET5 7.5 c 6.5 £ 6 ■o 5.5 a 5 E 4 -5 §3.5 2.5 -■— fzPBD -♦ -MMD •X--- FCB ■ -NBM sFCM 1.25 1.375 1.5 1.625 1.75 Fuzzifier (m) 1.875 Figure 4-22 : Algorithm performance on IRIS The IRIS dataset contains three clusters corresponding to the three classes of Iris flowers. Results are shown in Figure 4-22. NBM outperformed other algorithms. fzPBD 158 performed lightly less well, however, Table 4.23 shows that, compared with NBM, fzPBD had smaller compactness measures across multiple levels of m. Table 4.23 : IRIS: Compactness measure Algo. Fuzzifier factor m 1.25 1.375 1.5 1.625 1.75 1.875 2.00 fzPBD 0.44 0.44 0.44 0.44 0.44 0.44 0.44 MMD 0.44 0.44 0.44 0.44 0.44 0.44 0.44 FCB 11.84 11.81 11.78 11.76 11.75 11.75 11.75 NBM 1.75 1.75 1.75 1.74 1.74 1.74 1.74 sFCM 0.44 0.44 0.44 0.44 0.44 0.44 0.44 J2 56 - c o 4 _ ^* , ♦ >^ — ♦- — -> — — <r~ " ro 51 ■a ■o 1 46 - - ---+-- --^h- h- _ - ^ v — ■ fzPBD — ♦ -MMD u E ...*••• FCB *o 41 NBM sFCM E 1 6^ ...X---- X x.. # z 36 - ''% -X 31 - 1.25 1.375 1.5 1.625 Fuzzifier (m) 1.75 1.875 2 Figure 4-23 : Algorithm performance on WINE dataset The Wine dataset contains information on 13 attributes of the three classes of wines. Results are shown in Figure 4-23. fzPBD and FCB performed better than other algorithms. fzPBD performed better than FCB with the fuzzifier factor levels less than 159 1.8. Also, Table 4.24 shows that fzPBD performed better than FCB across multiple levels of m. Table 4.24 : WINE- Compactness measure Algo. Fuzzifier factor m 1.25 1.375 1.5 1.625 1.75 1.875 2.00 fzPBD 0.08 0.08 0.08 0.08 0.08 0.08 0.08 MMD 0.05 0.05 0.05 0.05 0.05 0.05 0.06 FCB 0.19 0.18 0.18 0.17 0.16 0.16 0.17 NBM 0.17 0.17 0.17 0.18 0.18 0.18 0.18 sFCM 0.05 0.05 0.05 0.05 0.05 0.05 0.05 108 88 68 £ 48 I 28 1 ^ // r *^ \'r^ '■=-"■■*"*■" S w ' „A y£ *X* X w ^ .* >(• * -■— fzPBD -♦ -MMD •X— FCB NBM sFCM 1.25 1.375 1.5 1.625 1.75 Fuzzifier (m) 1.875 Figure 4-24 : Algorithm performance on GLASS dataset The Glass dataset contains information on nine attributes of six classes of glass used in building construction. Results are shown in Figure 4-24. fzPBD significantly outperformed all the other algorithms. fzPBD outperformed the other methods on both 160 artificial and real datasets, particularly on the datasets with clusters that differed in size. fzPBD is, therefore, appropriate for real-world datasets, where usually the data densities are not uniformly distributed. 4.2.9 Fuzzy genetic subtractive clustering method (fzGASCE) [157] We have addressed the key problems of the fuzzy C-Means algorithm: initialization of the fuzzy partition matrix, evaluation of fuzzy clustering results, missing value imputation and fuzzy partition defuzzification. The experimental results show that our methods outperformed the existing algorithms addressing the same issues. Yet another issue with FCM algorithm is that, like most other partitioning approaches, it cannot by itself determine the number of clusters, and, as in previous studies, its results depend strongly on initial parameters. For some initial values, FCM will converge rapidly to a global optimum, but, for others, it may become stuck in a local optimum. Common approach to address the limitations of FCM is to integrate FCM with the Genetic Algorithm (GA), where the GA is used to manage a set of solution candidates. The FCM algorithm is then applied and a cluster validity index is used as the GA fitness function to search for the best solution. Ghosh et al. [163] and Liu et al. [166] proposed to use the partition coefficient (PC) [92] and Xie and Beni (XB) [94] validity indices. In addition, Liu et al. [166] proposed a modified version of the PC index (MPC) in order to reduce the monotonic tendency of the index. The Fukuyama-Sugeno cluster index (FS) [93], which measures the compactness and separation of the cluster partition, was used by Ghosh et al. [163]. Lianjiang et al. [164], in a novel self-Adaptive Genetic Fuzzy C- Means algorithm (AGFCM), proposed a validity index combining the PC index with 161 total variation of fuzzy partition. Haider, Pramanik and Kar [162] proposed a GA fitness function (HPK) based on the compactness measure and combined it with an intra-inter validity index for a novel algorithm that automatically determines the number of clusters in the dataset. Lin et al. [165] proposed a combination of GA and FCM for a novel method with adaptive cluster validity index (ACVI) based on the intra and inter measures of the fuzzy partition, where GA is used to solve the trade-off problem between these two factors for a better evaluation of the fuzzy partition. A common limitation of existing methods using GA with FCM is that the GA fitness functions are based on the cluster validity indices, which usually have a problem with scaling between the compactness and separation factors. In addition, they use the maximum membership degree for defuzzification that may be improper, as described in Section (4.2.8). In this section, we combine FCM with GA and fuzzy SC algorithms for a novel clustering algorithm, fzGASCE, that automatically determines the number of clusters in the dataset. The FCM algorithm rapidly determines the exact clustering prototypes of each solution candidate so that the GA algorithm, managing a set of such candidates, can select the optimal one. The fuzzy SC method helps the GA algorithm escape any local optima [157]. The fuzzy subtractive clustering (fzSC) method has advantages over the traditional one in that it is more efficient from the computational viewpoint. In addition, it is more appropriate for real world problems, because it does not require additional information regarding the mountain peak, mountain radii, or the threshold between the current density and original density for a data object to be selected as a cluster 162 candidate. However, fzSC uses a random fuzzy partition, and may perform worse in cases where the partition is not the result of a convergent fuzzy partitioning process. For the artificial dataset, the FCM algorithm can converge very rapidly, but for some real datasets, because of overlap regions where a data point can belong to two or more clusters, the FCM algorithm may not converge to a local optimum. The fuzzy strong uniform criteria may not hold with the new fuzzy partition created from the fuzzy partition by the FCM algorithm. Hence, the cluster candidates determined by our fzSC may not be optimal. In addition, a local optimal fuzzy partition may cause the fzBLE method to improperly pick the best partition from a given set of fuzzy partitions. Method design Chromosome We used chromosome to represent the whole clustering solution. Each chromosome contains a set of loci each standing for the index of the data point selected as cluster center. We set the length of chromosomes to Vn, which is assumed the maximum number of clusters in the dataset. Crossover operator The crossover operator is used to produce two new offspring from a given pair of parents. Both the roulette wheel and tournament selection methods are used interchangeably to select parents maintaining potentially useful solutions in the current generation. A two-point crossover operator with probability P c , P c = 0.5, is used. Mutation operator 163 The mutation operator is used to make changes in portions of the chromosomes of newly created members. Because each chromosome encodes data points representing the cluster centers of a clustering solution, changing of the data points in the chromosome may improve the clustering quality. We, therefore, propose three different tasks for the mutation operator: (i) add a data point as a new cluster center, because this may help to locate a new cluster in a higher density region, (ii) remove a data point to prevent the inclusion of a sparse cluster, and (iii) replace one data point with another so that the new cluster is located in a higher density region. These tasks are commonly used by existing methods. However, they are employed in a random or heuristic way, and cannot guarantee that the GA algorithm will escape local optima. To address this issue, we applied our fzSC method. The fuzzy partitions of the parents are used to estimate the density at every data point using (4.13), (4.14), (4.15), (4.16) and (4.17). At each time, t, the data point, x* with the highest density, M*, is selected, and the densities at the remaining data points are updated as in (4.19). Such points, where the densities change less than a predefined ratio, R M , are considered to be significantly dense and are used in the mutation operator of fzGASCE. This is instead of using randomly selected data points, as in existing methods. We chose a value of 0.95 for R M . The selection of the value of R M does not affect the outcome; however, a low value of Rm may slow the convergence process of fzGASCE. Fitness function Instead of using a cluster validity index for the fitness function, we use our cluster validity method, fzBLE, for cluster evaluation. For each chromosome, FCM is 164 applied to generate the fuzzy partition, which is then used to generate a probabilistic model of the data distributions as in (4.25), (4.26) and (4.27). The fitness function, given a fuzzy partition 0, = {U,V}, is defined: fitness(0) = log[L(0|X)] - log(c), (4.52) where L(0|X), the likelihood of the clustering model and the data, is measured as: L(0|X) = L(U,V|X) n n c (4 53^) =np(x 1 iu,v)=nzp(v k )xp( Xl iv k >. i=l i=l k=l fzGASCE algorithm Input: data to cluster X={xi}, i=l,. . .,n. Output: an optimal fuzzy clustering solution, c: optimal number of clusters. V = {vi }, i =l,...,c: the cluster centers. U= {uki} , i= 1 , . . . ,n, k= 1 , . . . ,c : the partition matrix. Steps 1) Randomly generate a set of chromosomes describing clustering solution candidates. 2) Compute the fitness value for every chromosome. 3) If the stop criteria are met, then go to Step 7. 4) Apply the crossover operator with the probability P c = 0.5, and the roulette wheel and tournament parent selection methods. 165 5) Apply the mutation operator with the probability P m = 0.01 . The significantly dense data points are used in the replacement, and fresh offspring are created using these points with a probability of P c xP m . 6) Go to Step 2. 7) Select the 'best' chromosome from the population as the clustering solution. 8) Apply [157] for the defuzzification of the fuzzy partition of the solution. Method testing To evaluate the performance of fzGASCE, we used four artificial datasets ASET2, ASET3 and ASET4 each have well-separated clusters of similar sizes. The number of clusters and data dimensions of these dataset are (5,2), (5,3) and (11,5), respectively. ASET5 is more complex, containing three clusters that differ in size and density (Figure 3-16). For real datasets, we used the Iris and Wine datasets. Performance measures We used three measures, COR, EVAR and MISC, to evaluate algorithm performance. COR is the correctness ratio, defined as, COR = ^|>-c), ( 4 - 54 ) where N is the number of trials, c and c are the number of clusters and the predicted number of clusters respectively, and I(.) is defined as: [ l,x = I(x) = 0,x*0 166 EVAR is a measure of the accuracy of the predicted number of clusters defined as in (4.55). EVAR = ^V( C " £ ) 2 - ( 4 - 55 ) MISC is a measure of the overall performance, determined by the number of data objects that were misclassified, defined as in (3.21). To be fair, MISC is calculated only when an algorithm correctly identifies the number of clusters. Then, the assigned cluster label of each object is compared with its actual cluster label. If any of them do not match, a misclassification has occurred. We compared the performance of fzGASCE with those of the eight genetic algorithm methods that also use FCM, specifically, PBMF, MPC, HPK, AGFCM, XB, FS, PC and ACVI [162-166]. An earlier version of fzGASCE, fzGAE, which did not include the fuzzy SC and defuzzification methods, was also used. For each dataset, the number of clusters, c, was set to the known number of clusters. All algorithms were run using a population size of 24 and a maximum number of 100 generations. The fuzzy partition of each chromosome was generated using the FCM algorithm with 10 iterations, and the fuzzifier factor, m, was set to 1.25. We repeated the experiment 100 times and averaged the performance of each algorithm using values of COR, EVAR and MISC. ASET2 contains five clusters in a 2-dimensional data space. The clusters are well- separated and of the same size. Performance of all algorithms is shown in Table 4.25. All algorithms had very low MISC measures, indicating that they grouped the data 167 points into the correct clusters. However, fzGASCE outperformed all algorithms by all three measures and fzGAE performed better than the other methods with the exception of fzGASCE. This comparison illustrates the advantage of using our fzBLE method [147] in the GA fitness function. The use of fuzzy SC [146] in fzGASCE also improved its performance, particularly in escaping local optima. Table 4.25 : Performance of GA algorithms on ASET2 Algorithm COR EVAR MISC fzGASCE 1.000 0.000 0.000 fzGAE 0.640 0.500 0.000 PBMF 0.510 0.590 0.000 MPC 0.290 0.970 0.000 HPK 0.100 5.010 0.021 AGFCM 0.600 2.800 0.000 XB 0.490 1.450 0.000 FS 0.120 1.100 0.070 PC 0.230 1.040 0.000 ACVI 0.200 2.490 0.011 The ASET3 contains five well-separated clusters in a 3 -dimensional data space while the ASET4 contains 11 clusters in a 5-dimensional data space. Performance of all algorithms on ASET3 and ASET4 are shown in Tables 4.26 and 4.27 respectively. On both datasets, fzGASCE outperformed all other algorithms, while fzGAE was the second best. 168 Table 4.26 : Performance of GA algorithms on ASET3 Algorithm COR EVAR MISC fzGASCE 1.000 0.000 0.000 fzGAE 0.710 0.380 0.000 PBMF 0.600 0.450 0.000 MPC 0.610 0.860 0.000 HPK 0.120 5.240 0.000 AGFCM 0.650 1.490 0.000 XB 0.640 0.430 0.000 FS 0.520 0.840 0.011 PC 0.620 0.890 0.000 ACVI 0.100 2.100 0.000 Table 4.27 : Performance of GA algorithm on ASET4 Algorithm COR EVAR MISC fzGASCE 1.000 0.000 0.000 fzGAE 0.450 0.750 0.000 PBMF 0.340 1.000 0.000 MPC 0.420 0.820 0.002 HPK 0.010 1.910 0.037 AGFCM 0.340 2.380 0.000 XB 0.410 0.900 0.000 FS 0.400 0.880 0.000 PC 0.450 0.700 0.000 ACVI 0.170 4.650 0.000 169 The ASET5 dataset is non-uniform with three clusters in a 2-dimensional data space. Table 4.28 shows the algorithm performance. HPK and AGFCM both failed to determine the number of clusters. fzGASCE not only outperformed the other algorithms, with COR =1.0, but also was the only algorithm that grouped all of the data points into the correct clusters. Although fzGAE performed better than the remaining algorithms, it failed to correctly group data points into clusters, similar to other methods that used the maximum membership degree for defuzzification. Table 4.28 : Performance of GA algorithms on ASET5 Algorithm COR EVAR MISC fzGASCE 1.000 0.000 0.000 fzGAE 0.900 0.100 0.107 PBMF 0.700 0.300 0.107 MPC 0.050 0.960 0.107 HPK 0.000 5.770 - AGFCM 0.000 8.470 - XB 0.040 0.960 0.107 FS 0.020 3.480 0.107 PC 0.050 0.960 0.107 ACVI 0.080 0.920 0.107 The IRIS dataset contains three clusters corresponding to the three classes of Iris flowers. The performance of the algorithms on this dataset is shown in Table 4.29. HPK and AGFCM again completely failed at detecting the number of clusters. fzGASCE outperformed other algorithms in detecting the number of clusters as well as in grouping data points into their own clusters. 170 Table 4.29 : Performance of GA algorithms on IRIS Algorithm COR EVAR MISC fzGASCE 1.000 0.000 0.033 fzGAE 0.880 0.120 0.040 PBMF 0.860 0.140 0.040 MPC 0.040 0.970 0.160 HPK 0.000 5.720 - AGFCM 0.000 8.120 - XB 0.050 1.010 0.040 FS 0.390 0.780 0.154 PC 0.080 0.920 0.115 ACVI 0.150 0.850 0.040 Table 4.30 : Performanc of GA algorithms ; on WINE Algorithm COR EVAR MISC fzGASCE 1.000 0.000 0.213 fzGAE 0.860 0.140 0.303 PBMF 0.000 2.050 - MPC 0.000 2.810 - HPK 0.000 6.760 - AGFCM 0.000 9.210 - XB 0.270 1.010 0.303 FS 0.000 5.720 - PC 0.110 0.920 0.303 ACVI 0.090 0.910 0.303 171 The Wine dataset contains information on 13 attributes of three classes of wines. Results on this dataset are shown in Table 4.30. Only fzGASCE and fzGAE identified the correct number of clusters, with COR values of 1 and 0.86, respectively. Among the other algorithms, only XB and ACVI detected the correct number of clusters, but only with low COR values. Overall, fzGASCE outperformed other methods on both artificial and real datasets. It performed particularly well on datasets with clusters that differed in size, not only in predicting the correct number of clusters, but also in grouping the data points into the correct clusters. fzGASCE is, therefore, appropriate for real-world problems, where the data densities are not uniformly distributed. 4.3 Motif finding problem 4.3.1 HIGEDA algorithm We present a new algorithm, HIGEDA [145], applicable to either DNA or protein sequences, which uses the TCM model and combines the hierarchical gene-set genetics algorithm, HGA (Hong, 2008) with EM and Dynamic Programming (DP) algorithms to find motifs with gaps. The HGA algorithm helps HIGEDA manage motif seeds and escape from local optima; the EM algorithm uses the best alignment of a motif model on the dataset, where the alignments are generated by DP to estimate the motif model parameters so that the model fits the best conserved forms of the motifs of interest. Method design 172 Given A = {ai, a2,..., ai}u{_}, a set of symbols used for sequence encoding and the gap symbol, 1=20 for protein sequences and 1=4 for DNA sequences, let S = {Si, S2,.,., S n } be a set of biological sequences based on A. Assume L = {Li, L 2 ,..., L n }, where Li is the length of sequence i, i=l,. . .,n. Assume that the length of shared motifs is a known value, W. With mi = Li-W+1, the number of possible positions of a motif in sequence i, denote Z = {zy}, i=l,...,n, j=l,...,mi, as the probability of motif occurrence at position j in sequence i. Let be the motif model. The motif finding problem is to determine and Z such that: P(S,Z|0)->max. (4.56) Motif model Of the motif model 0, two components M and B model the motif and non- motif (the background) positions in sequences. A motif is modeled by a sequence of discrete random variables whose values give the probabilities of each of the different letters occurring in each of the different positions of a motif occurrence. The background positions in the sequences are modeled by a single discrete random variable. The motif model is as follows: 173 b'° e M i e M 2 9 al,0 AM H al,l AM W al,2 = {0 B , M } = 6 a2,0 am W a2,l AM W a2,2 _ 6 al,0 AM y al,l AM W al,2 e M e e w M al,w M a2,w 6 al,w (4.57) where, a ,k, l<k<W, is the probability that symbol a occurs either at a background position or at position k of a motif occurrence. 5X k =i,vk=i,...,w. (4.58) asA The matrix in (4.57) is a position weight matrix (PWM), but its use differs from conventional PWMs in that it contains 21 rows for protein motifs and 5 rows for DNA motifs. The first row stands for the gap symbol, which may occur in a motif, but not in the first or last positions. The remaining rows stand for residue symbols, 20 amino acid letters or 4 nucleotide letters. W is chosen large enough to accommodate all possible motif consensus sequences. Figure 3-7 shows a motif model representing 'AC, 'AGC and 'ATC\ Sequence and motif model alignment with gaps Given an input subsequence s, the best alignment with gaps of s and is created using DP algorithm. Because s comes from a real dataset, gaps are allowed only in 0. The first symbol of s is aligned with the first column in 0. Consecutive symbols from s are aligned with either gap or symbol columns to achieve the best alignment score. A conventional dynamic alignment score is the sum of the pair wise alignment scores. Instead, here, it is the multiplication of all pair wise alignment scores, and hence it is the 174 probability that the best consensus from matches s. To control the occurrence of gaps, we define PPM, POG and PEG as the reward for a perfect match and the penalties for opening and extending a gap, respectively. We choose, by empirical experiments, PPM=1, POG=0.00875, and PEG=0.325. Let U jk be the best alignment score up to the j th symbol Sj in s and column k of©. Ujk is calculated as in (4.59). U jk u j _ uk _ 1 *pp M *e s %,k=j u, k _ 1 *PG*(i-e s ^ i;k+1 ),j=o,k>j [maxdJ^ *PPM*O s %,U j;k _ 1 *PG*(l-^ +i , k+1 )} (4.59) where PG is either PEG or POG depending on the gap status at column k in U. Let s='ACG' and be as shown in Figure 4-25. Figure 4-26 shows the dynamic alignment matrix {Uy} of s w.r.t 0. The best alignment score is 0.0039 which is found in the second row and the last column of the matrix. Hence, the best consensus for s=' ACG' by is 'AC'. 'AC is considered in-motif model and the last symbol 'G' is out-motif model for subsequence s. The motif model will be adjusted to fit 'AC' instead of 'ACG'. This feature produces the occurrence probabilities for gap symbols in our motif model even if such a symbol does not appear in the sequences set. 1 2 3 - 0.00 0.00 0.25 0.00 A 0.02 0.50 0.10 0.04 G 0.02 0.30 0.35 0.02 C 0.03 0.10 0.01 0.90 T 0.03 0.10 0.29 0.04 Figure 4-25 : A motif model 175 1 2 3 A 0.50 0.0043 - C - 0.0050 0.0039 G - - 0.0001 Figure 4-26 : Dynamic alignment of s='ACG' w.r.t © from Figure 4-25 Motif occurrence model Let S 1J be the subsequence of length W at position j in sequence i. Denote I(.) as the indicator function and Pm(S 1j ) and Pb(S 1j ) as the conditional probabilities that S 1J is generated using the motif model and the background model respectively. W L p M (s') = p(s ij i © m ) = nnfc) I<s " )tk " a) k=l a=l W L p B (s ij )=p B (s ij ie B )=nn( e aor" ,+ =a) (4.60) (4.61) k=l a=l Let X be the prior probability of motif occurrence at every possible position in the sequence set. Similarly to the TCM model in MEME [70], the motif occurrence probability at position j in sequence i is z ij = " ^*P M (S 1J ) X*P M (S 1J ) + (1-^)*P B (S 1J ) The motif log-odds score of S 1J is defined by: los(S 1J ) = log(P M (S 1J )/P B (S 1J )). S J is considered a motif hit if it satisfies the inequality: los(S 1J )>log[(l-X)A]. (4.62) (4.63) (4.64) 176 Gapped motif model When gaps are allowed in the alignment of S 1J and 0, Pm(S 1j ) and Pb(S 1j ) are calculated using the recovered version with gaps of S 1J , similar to that in Xie [178]. Let s m and s c be the in-motif and out-motif parts, respectively. Let s be the best consensus for S ij by 0. P M (S 1J ) = P B (s c )*P M (sO = P(s c |Z 9 B )*P(s*|Z 9 M ), (4.65) PB(S 1J ) = PB(s c )*P B (s m ) = P(s c |Z ? B )*P(s m |Z 9 B ). (4.66) Multiple-motif finding method To find multiple motifs in given dataset, we use a mask variable M, M={Mij}i=i j ... >n j=i v . Mm , where My represents the probability of the chance that a motif occurs again at position j in sequence i. M is initially set to {1}. The modified version of (4.62) with respect to M is z™ = P( Zij = 1 1 M) = z, * mm{M 1J+H } . (4.67) Once a motif is found, its positions are updated to M. M i(j+k) =M i(j+k) *(l-AM i(j+k) ), (4.68) where AM i(j+k) = P(zij=l)*(W-k)/W, k=0,...,W-l. The update mechanism in (4.68) allows for multiple overlapping motifs. Hierarchical gene-set genetics algorithm (HGA) The GA is a global optimization procedure that performs adaptive searches to find solutions to large scale optimization problems with multiple local optima. 177 Conventional GAs use crossover and mutation operators to escape local optima. These operators depend strongly on how the probabilities of crossover and mutation are chosen. Recent improvements in GA have focused on adaptively adjusting operator probabilities so that the genetics processes quickly escape local optima. However, setting up adaptive GAs is difficult and most approaches are based on heuristics. Here we use HGA, a GA improved by Hong [75]. HGA treats a chromosome as a set of gene-sets, not a set of genes as in conventional GAs, as a mechanism to escape local optima. Starting with gene-sets of the largest size equal to half the chromosome length and ending with gene sets of size 1, HGA performs crossover and mutation operations based on gene-set boundaries. When the model is (8, k)-convergent, HGA expects to find a global optimum and attempts to escape local optima, if any exist, by performing genetics operations with the largest size gene-sets. HGA is most appropriate to our genetics model, because each gene-set represents a set of adjacent columns in the motif model that represents patterns of adjacent residues in biological sequences. Genetics operations based on gene-set boundaries allow residues to come together as well as to change simultaneously. EM algorithm ofHIGEDA HIGEDA solves the motif finding problem using an EM algorithm to maximize the likelihood function (4.69) over the entire dataset. L(S,Z | 0) = logP(S,Z | 0) -> max . (4.69) Estimation step: Z is estimated using (4.63) (or (4.67)). 178 ,(t+l) _ ^*P M (S iJ ) ^ t *P M (S 1J ) + (l-X t )*P B (S 1J ) (4.70) Maximization step: and X are computed with respect to (4.70), L(S,Z|0) = X|;(l-z ij )logP B (S ij ) + z ij logP M (S ij ) i=l 1=1 Such that, + (1 - z { . ) log(l -X) + z {] log(X) -> max . (4.71) dL 1 ^ 3a nmtFj^ (4.72) With respect to constraints (4.58), the objective function is relaxed using Lagrange multipliers yk, k=0,. . .,W, w ( x \ L = L(s,z|0)+Xy k Ze ak -i . k=0 V a=l y aio g p M (s ij ) i ap M (s ij ) 50 ak p m (s ij ) ae a M k p m (s 1j ) W L nnfc) 1 J(S i , +k ._ 1 =a') k'=l a'=l ^kVk I(S lJ+k _, = a) = ' " J+ I(S,H. k -i=a) ak Hence, 8L ^ I(S ij+k -!=a) <? W ak i=l j=l So, ak ii in M _ i=l j=l er k = - (4.73) 179 Using (4.58) and (4.73) 1 n m SZE^^S^^a) M a=l i=l j=l IX =- a=l Yk n m 1 n m _ i=l j=l a=l i=l j=l ^ (4.74) Tk Using (4.73) and (4.74), ZZ* H = a ) ^ = - H n m • (4-75) i=l j=l z ij Similarly, n m W ZZ( 1 - z «)Z I ( s u*-i= a ) 0?o = ^ '" H ■ m M • (4-76) i=i j=i Refining model parameters The maximization step results in model parameters © and X. Because the values of Z are estimated during the EM processes, a straightforward update to © and X using (4.70), (4.75) and (4.76) may cause an oscillation in convergence. MEME estimates X by trying values from \l\m4n J to 1/(W+1). This requires significant computational time and squanders the maximization step benefit. We apply the gradient descent learning law to update model parameter, £,, of the model, 4(t+l) = $(t) + Mt x A^, (4.77) 180 where AS, = £, e (t+l) - £,(t) and £,(t), £, e (t+l) and £,(t+l) are the current, estimated and new values of £,, respectively. The learning rate jLX t may be slightly reduced by processing time. A popular form of jLX t is as in (4.78), where T is the processing duration. ^^(i-X.Ixt)- (4 - 78 > If the size of the dataset is small, some elements of© may be zero. These are not appropriate to the Bayesian process and remain zero. To solve this problem, MEME uses Dirichlet mixture model [34]. While this has a strong mathematical basis, the drawback lies in how to select the right number of mixture components. In addition, this number has no meaning in sequence analysis. Instead, we use pseudo-count methods. For DNA, we borrow an added pseudo-counts method from Henikoff and Henikoff [74]. The added portion used is 0.01. For proteins, we propose a method using the motif model and the substitution probability matrix from BLOSUM62, and a heuristic that the pseudo-counts should be position specific and depend on strong signals [145]. The pseudo-count value for a given symbol a, in column k of © is calculated using Equation (4.79). P SC a k =Z l P a/b> (4.79) b=l where P a /b is the BLOSUM substitution probability for amino acid a from the observation of amino acid b. The motif model is then refined using (4.80) in which, a and P are predefined, and a+P = 1 . ©'a,k = a.0 a ,k+p.psc ak . (4.80) 181 HIGEDA Algorithm HIGEDA uses HGA to manage its population and to escape local optima. During the evolution process, members of the current generation are processed with the EM algorithm a small number of times, and ranked based on their goodness, measured using a fitness function. The best members are used to create new ones using crossover and mutation operators. Newly created members replace the worst ones in the current generation using a tournament selection to form the new generation. This process repeats until the number of generations is equal to a predefined number, or the current generation goodness ratio is convergent. The best members from the last generation are taken as possible motif candidates. Each member contains a variable X and a chromosome encoding 0, described in (4.57); each gene in the chromosome represents a column in © that has 21 elements for a protein motif, or five elements for a DNA motif. There are W+l such genes in each chromosome. We note that while Bi [71] proposed encoding PWM using chromosomes, he did not discuss it in detail. Because the length W of a motif is small relative to the sequence lengths, our genetics model consumes less memory than those of Wei and Jensen [83] or Bi [71]. The gene-set maximum size is lo = W/2, the initial size is 1 and the final size is 1, such that, 1 = 2q, where 2q < 1 < 2q+l. (4.81) We use 8=0.05 and k=9 for (8, k)-convergent criterion. 182 The crossover operator is used to produce two new children from a given pair of parents. We use a two-point gene-set crossover operator with probability p c = 0.95. The mutation operator is used to make changes in some portions of the chromosomes of newly created members. Because each chromosome encodes a © resulting from an alignment on the whole sequence set, a one position shift to the left or right may improve the quality of the alignment. We propose two different tasks for the mutation operator: (i) change in a gene-set: Two rows are selected randomly. Cells corresponding to the selected rows in the given gene-set are exchanged, and (ii) change to whole chromosome: Gene-sets in the chromosome are shifted left or right by one position. The blank gene-sets are filled with average probability values. The fitness function of HIGEDA is a combination of the objective function (4.69) and the posterior scoring function (4.80). The objective function assesses the best fit model, while the posterior scoring function determines how well the model is distinguished from the background in the entire dataset. We use the motif posterior scoring approach of Nowakowski-Tiuryn [79] and estimate the prior probability of motif occurrence. It follows that the posterior score of subsequence S 1J at position j in sequence i has the form: , pv J 1J |Q^) = log XxPm( S ' J) (l-^)xP B (S 1J ) S p (S 1J |0A) = k)g „ ^ / (4.82) (4.69) and (4.80) are normalized and combined. Both are important to the model evaluation, but not equally in all contexts. The fitness function is first applied to find best fit models, and then used with the posterior scoring function to select the most 183 significant model. To this rule, we add a scaling function v(t), which is shown in Figure 4-27. The fitness function of HIGEDA is then defined as: v fit(0) = ^-L(S,Z|0) + (v M -v)££ =1 .1=1 P(Zjj=l) |P(Zij=l)| :S p (S 1J |©A). (4.83) v M Figure 4-27 : v(t) = Vm x T / (T + t 2 ). Vm is the maximum value of v. To find gapped motifs, in later phases, HIGEDA tries to improve the most significant model by introducing gaps. This is similar to a local alignment problem, but instead of aligning every sequence to the sequence set, HIGEDA, using DP, aligns the sequence to the motif model that describes the local alignment of the sequence set. While the run time of gapless alignments is O(W), that of DP alignment is 0(W 2 ). By restricting the use of DP, the speed of HIGEDA is significantly improved. 184 HIGEDA Algorithm [145] • Input: Sequence set S, motif length W. • Output: Motif model 0. Steps 1) Set M={1} once, set gene-set parameters using (4.71), and randomly generate the first generation, set gNo=0. 2) For each member in the current generation, apply EM once. Measure member goodness using the fitness function (4.73). 3) Select parents from the best members using ranking selection; apply genetics operator to create new members. 4) Create new generation using tournament selection, gNo++. 5) If the current generation is (8, k)-convergent, then 6) save ( 1 ), set 1=10, 7) Apply genetics operators to refresh current generation. 8) Restore ( 1 ), go to Step 2. 9) If stop criteria are met, then 10) set 1 = 1/2, set gNo = 0. 11) Ifl>0,thengotoStep2. 12) Take the best member as the motif candidate; run EM with the candidate until convergent to obtain the motif model. 13) Output the motif model and update M using (4.68). 14) Stop, or go to Step 1 to find more motifs. Method testing We compared HIGEDA with five open source algorithms, GAME-2006, MEME v3.5, GLAM2-2008, BioProspector-2004 and PRATT v2.1, one web application, GEMFA (http://gemfa.cmh.edu), and for the EM motif algorithm, we used results from 185 Bi [71]. All algorithms were run 20 times on each dataset with the runtimes recorded; for GEMFA, this was obtained from the website. Default run parameters were used for all programs. For HIGEDA, the number of generations at each gene-set level was 50 and the population size for all gene-set levels was 60. For algorithms in Tables 4.31, 4.32 and 4.33, motif lengths were known motif lengths. For Tables 4.34 and 4.35, we first ran HIGEDA to find statistically significant motifs, and then ran other algorithms using the lengths of best motifs found by HIGEDA. Performance measures As in Bi [71], two quantities are used to evaluate the algorithms: LPC, the letter level performance coefficient and SPC, the motif site level performance coefficient. If we denote delta(.) as the indicator function, and Oi and Ai, respectively, as the set of known and predicted motif positions in sequence i, then, LPQS) = - Y \A { nOj/lA. uOj , (4.84) ntT SPC(S) = -V delta(A 1 nO^ Empty) . (4.85) Simulated DNA datasets We generated simulated DNA datasets as in Bi [71], with three different background base compositions: (a) uniform, where A, T, C, G occur with equal frequency, (b) AT-rich (AT = 60%), and (c) CG-rich (CG = 60%). The motif string, GTCACGCCGATATTG, was merged once (or twice, if needed) into each sequence, after a defined level of the string change: (i) 9% change representing limited divergence 186 (i.e. 91% of symbols are identical to the original string), (ii) 21% change, or (iii) 30% change (essentially background or random sequence variation). Table 4.3 1 : Average performance (LPC/SPC) on simulated DNA datasets Motif identity Algorithm Uniform AT-Rich CG-Rich A vg. Runtime 91% HIGEDA 1.00/1.00 1.00/1.00 1.00/1.00 GEMFA 0.98/1.00 0.98/1.00 1.00/1.00 GAME 0.86/0.88 0.88/0.90 0.91/0.94 MEME 1.00/1.00 1.00/1.00 1.00/1.00 GLAM2 1.00/1.00 1.00/1.00 1.00/1.00 BioPro. 0.99/1.00 0.99/1.00 0.94/0.95 PRATT 0.83/0.95 0.88/1.00 0.46/0.80 EM 0.99/1.00 0.99/1.00 1.00/1.00 32s 22s 2m26s 2s 52s 2s 0.2s 79% HIGEDA 0.87/0.96 1.00/1.00 1.00/1.00 GEMFA 0.87/0.88 0.87/0.90 0.85/0.89 GAME 0.43/0.55 0.55/0.61 0.64/0.71 MEME 0.95/0.95 1.00/1.00 1.00/1.00 GLAM2 1.00/1.00 0.07/0.25 1.00/1.00 BioPro. 0.86/0.92 0.89/0.95 0.94/0.98 PRATT 0.46/0.70 0.03/0.15 0.09/0.15 EM 0.83/0.87 0.89/0.91 0.87/0.89 32s 18s 2ml Is 2s 57s 2s 0.2s 70% HIGEDA 0.79/0.81 0.84/0.89 0.65/0.76 GEMFA 0.56/0.65 0.50/0.60 0.52/0.56 GAME 0.14/0.28 0.19/0.34 0.20/0.34 MEME 0.44/0.50 0.75/0.75 0.27/0.30 GLAM2 0.95/0.95 0.01/0.05 0.00/0.05 BioPro. 0.26/0.33 0.39/0.44 0.25/0.33 PRATT 0.31/0.40 0.05/0.10 0.19/0.30 EM 0.38/0.48 0.47/0.58 0.48/0.54 32s 20s lm55s 2s lm20s 2s 0.2s 187 In Table 4.31, we compared results with those obtained with seven other algorithms. When motif sequences are 91% identical, all algorithms performed equally well regardless of base composition. However, when the identity dropped to 79% or 70%), HIGEDA in general performed as well or significantly better than other algorithms on all base compositions. Hence, HIGEDA is not significantly affected by noise. Finding motifs in biological DNA sequences Table 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA datasets (# of sequences/length of motif/# of motif occurrences). Algo. E2F(27/ll/25) ERE(25/13/25) crp(24/22/18) arcA(13/15/13) HIGEDA 0.57/0.96/33s 0.58/0.87/37s 0.75/0.90/22s 0.38/0.53/56s GEMFA 0.64/0.85/39s 0.74/0.92/43s 0.57/0.88/12s 0.32/0.42/26s GAME .24/.90/2m45s .24/.75/2mlls .45/.80/2m33s .05/.10/lm26s MEME 0.71/0.85/16s 0.68/0.68/4s 0.55/0.68/ls 0.47/0.54/4s GLAM2 .84/.93/lm28s .74/.92/lml0s .54/.64/lml3s .47/.54/lml7s PRATT 0.17/0.33/.2s 0.24/0.44/.2s 0.22/0.56/. Is 0.17/0.23/.4s BioPros. 0. 50/0.63/3s 0.65/0.71/3s 0.37/0.45/ls 0.01/0.01/6s Algo. argR(17/18/17) purR(20/26/20) tyrR(17/22/17) ihf( 17/48/24) HIGEDA .75/.94/lm22s .90/.93/lm38s .34/.41/lml9s .14/.31/3m44s GEMFA 0.35/0.38/30s 0.81/0.90/39s 0.32/0.49/27s 0.11/0.26/lm GAME .36/.55/2m31 .33/.53/3m23s .ll/.23/2m39s .07/.18/10m7s MEME 0.75/0.94/5s 0.58/0.85/4s 0.43/0.47/3s 0.00/0.00/5s GLAM2 .38/.47/lm32s .17/.50/3mlls .35/.35/lm55s .12/.33/9m55s PRATT 0.16/0.29/.2s 0.46/0.95/.2s 0.06/0. 12/.3s 0.06/0.25/.5s BioPros. 0.62/0.69/8s 0.18/0.20/7s 0.33/0.42/6s 0.05/0.13/17s 188 We used eight DNA transcription factor binding site datasets, two eukaryotic datasets, ERE and E2F, and six bacterial datasets: CRP, Arc A, ArgR, PurR, TyrR and IHF. Table 4.32 shows that HIGEDA performed as well or better than other algorithms on most datasets. One exception is ERE, possibly because it is an example of OOPS which is the motif finding model used by GEMFA. Also, GLAM2 performed better than HIGEDA on IHF, which contains a motif of length 48. Finding motifs in biological protein sequences We selected twelve protein families from PFAM and Prosite representing motifs with different levels of sequence specificity, from completely defined to more subtle, degenerate and gapped. 3- 2- 1 - OJ CN CO -* kD Figure 4-28 : Strep-H-triad motif by HIGEDA Results obtained with HIGEDA are compared to those obtained with other algorithms (Table 4.33). The gap parameters of all algorithms, if supported, are set to known gap structures. All algorithms identify the unambiguous DUF356 IHPPAH motif. For the Strep-H-triad motif, the HIDEGA, MEME and GLAM2 consensus sequences are more specific than the known motif, because of the motif decoding procedure. Full degeneracy is shown in Figure 4-28. 189 Table 4.33 : Detection of protein motifs (1-8, PFAM; 9-12, Prosite) Family (#seq, max L) Algorithm (run time) Known motif Predicted motif 1. DUF356 (20,158) HIGEDA(12s) MEME (2s) GLAM2(lm58s) PRATT (0.1s) IHPPAH IHPPAH IHPPAH IHPPAH IHPPxH 2. Strep-H-triad (21,486) HIGEDA (40s) MEME (5 s) GLAM2(lm51s) PRATT (0.1s) HxxHxH HGDHYH HGDHYH HGDHYH HxxHxH 3. Excalibur (26,296) HIGEDA (lm) MEME (4s) GLAM2 (2m56s) PRATT (0.5s) D x D x DG xx CE D R D [RNGK] DG [IV] [AG]CE [WY] Q [GA] [NW] Y Y L K S D DRDKDGVACE D x D xxx C 4. Nup-retrotrp (14,1475) HIGEDA (3ml 8s) MEME (10s) GLAM2 (6m55s) PRATT (Is) GRKIxxxxxRRKx SGRKIKTAVRRKK W[DE]C[DE][TV]C[LC][VL]QNK[AP][ED] SNGKNMFSSSGTS FSSS[GP1TxxSx(l,2)RK 5. Flagellin-C (42, 1074) HIGEDA (lm29s) MEME (6s) GLAM2 (3m35s) PRATT (0.07s) NRFxSxIxxL RA [NDQG] L G A [FV] Q N R R [AS] [DNQ] L G A [VF] Q N R RADLGAFQNR A-x-Q 6. Xin Repeat (25,3785) HIGEDA (5m45s) MEME (lml s) GLAM2 (6m36s) PRATT (6s) GDV[KQR] [TSG]x[RKT] WLFETxPLD GDV[RK] [ST] [ACT] [RK] WLFETQPLD KGDV [RK] T [CA][RK]W[LM]FETQPL HKGDVRTCRWLFETQP GDVxTxxWxFETxP 7. Planc.Cyt. C (12,1131) HIGEDA (lm) MEME (9s) GLAM2 (3m42s) PRATT (0.02s) C{CPWHF} {CPWR}CH{CFYW} [NHKS Y] C [ AQELMFT V] [ AGS] CH FSPDGK RFSPDG PDxxxL 8. zf-C2H2 (115,99) HIGEDA (2ml 2s) MEME (45 s) GLAM2 (19ml 7s) PRATT (0.1s) Cx(l-5)Cx3#x5#x2Hx(3-6)[HC] C[_Y]LRQEK][_C]LEKP][_EGIK]C[_G]K[ARST]F S[RQK] [KS] S [NHS]L[NKT] [RKST] H [QILKM]R[IST V]H EIC[NG]KGFQRDQNLQLHRRGHNLPW YKC_P_CGK_FS_KSSLT_H_RI_HT Hx(3-5)H 190 Table 4.33 (con't.) Family (#seq, max L) Algorithm (run time) Known motif Predicted motif 9. EGF_2 (16,3235) HIGEDA(4ml9) MEME (15s) GLAM2 (8m) PRATT (Is) CxCx2 [GP] [FYW]x(4-8)C CLEK][_C][_EILSV]C[NDE][NDQEPS]G [FWY][AQESTY]G[DS]DCS[GI] GECxCNxGYxGSDCSI CNx(0-9)GECxCNEGWSGDDC Gx(0-l)CxCx5Gx2C 10. LIG-Cyclin- 1(23,1684) HIGEDA (lm8s) MEME (18s) GLAM2 (2m35s) PRATT (0.5s) [RK]xLx(0- 1 )[FYLIVMP] [RK]R[RL] [RIL] [DEIFY] PAPAP Tx(0-1)RKP Kx(0-1)R ll.LIG_PCNA(13, 1616) HIGEDA (lm32s) MEME (9s) GLAM2 (4m59s) PRATT (0.3s) x(0-3)x{FHWY} [ILM] {P} {FHILVWYP} [DHFM][FMY] [RGIKP] LQK] [_RDKS] [DPST] [IL] [SV]FFG GQKTIMSFFS Qx(0-1)SIDSFF[K_][R_] Lx2Sx(2-4)K [KMTY] 12. Mod_Tyr_Itam (10,317) HIGEDA (51s) MEME (2s) GLAM2 (3m43s) PRATT (0.1s) [DE]x(2)Yx(2)[LI]xx(6- 1 2)Yx(2)[LI] [RDQE] [_S] [_CET] [_IL] [_GM] [_QTY] LQELK] [_D GK] [EI] [_RS] [_RN] [_RGP]LQ[DGV] [GHT] YxM[CI Y]Q[NGT]L[ILS] GKEDDGLYEGLNIDDCATYEDIHM E H SLAQKSMDH _SRQ Lx2Lx(0-l)L HIGEDA and GLAM2 identified the exact motif for the moderately degenerate Excalibur (Section 2.5.6). Only HIGEDA identified Nup-retrotrp motif erring in only one residue and identifying the key arginine (R) residues. No algorithm identified the Flagellin-C motif, possibly because of the weak motif signal. While all performed well in finding the Xin repeat motif, only HIGEDA identified the Planctomycete Cytochrome C motif, erring in only one residue. For the gapped motif ZF-C2H2 family, only HIGEDA identified the key C, H, F and L residues and the spacing between them. HIGEDA also provided better results on EGF2 and LIGCYCLIN1. No algorithm 191 identified the LIG_PCNA and MODTYRITAM motifs. Together, these results show that HIGEDA can effectively find more subtle motifs, including those with gaps. Because HIGADA uses GA to find motif seeds, its run times are longer than those of MEME and PRATT, but shorter than those of GLAM2, as indicated in Table 4.33. 4.3.2 New motif discovery using HIGEDA To test HIGEDA in prediction of novel motifs, we selected the following proteins from PFAM: those containing only a ZZ domain, only a SWIRM domain, a ZZ plus a Myb domain, and a ZZ plus both a Myb and a SWIRM domain. While ZZ domains have a consensus sequence in PFAM, SWIRM and MYB proteins are defined only by experimental demonstration of functional properties and protein sequence alignments. We ran HIGEDA on these groups varying the output motif length from 4 to 33. We defined statistically significant motifs as those found in more than 85% of sequences, and having a p-value less than 0.0001 (computed using Touzet and Varre [82]. We also ran MEME, GLAM2 and PRATT. In Table 4.34, we first show that HIGEDA, MEME and PRATT successfully identified patterns of the known ZZ motif, CX(1-5)C. No motif was discovered in the SWIRM-containing proteins. But HIGEDA, MEME and GLAM2 each discovered a motif, WxAxEELLLL, common to Myb proteins, which we propose as a novel domain. Uniquely discovered by HIGEDA, we propose a second novel domain, GNW[AQ]DIADH[IV]G[NGST], of unknown function, which is specific to the ZZ- SWIRM-Myb protein set. 192 Table 4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms Algorithm Domain / Found motif p-value % Hit ZZ (67,5125) HIGEDA D[FY]DLC[AQES]xC-[EYV] 5.76e-7 91.67 C[GHPY]D[FY]DLC[AQES]xC 2.25e-6 93.06 MEME CLICPDYDLC 73.00 GLAM2 HSRDHP[ML] [IL] [QRK] 91.00 PRATT Cx2C 92.00 SWIRM (49,624): No motif found ZZ-Myb (35,2697) HIGEDA [ND]W[ST]A[DE]EELLLL 3.81e-9 97.14 MEME WTADEELLLLD 100.0 GLAM2 WTAEEELLLLE 97.00 PRATT Ex2Lx(4-5)E 96.00 ZZ-Myb-SWIRM (43,1049) HIGEDA GNW[AQ]DIADH[IV]G[NGST] 3.6e-10 100.0 WGADEELLLLEG 2.4e-10 100.0 MEME WGADEELLLLEG 100.0 GLAM2 KQLCNTLRILPK 85.00 PRATT Wx( 1 -2)D[ADEQ]Ex(2-3)L[ILV] 100.0 Lastly, we selected the protein families, Presenilin-1 and Signal peptide peptidase, that share the conserved sequence GXGD and are grouped in the same clan in PFAM, as shown in Table 4.35. Only HIGEDA successfully discovered motifs containing GLGD in these families. In addition, two patterns, YDIFWVF, which is also identified by MEME and GLAM2, and FGT[NDP]VMVTVA[KT], which is also identified by MEME, appear 193 frequently (in 86% of 151 protein sequences) in the Signal peptide peptidase family. We, therefore, propose these patterns as new motifs of this family. Table 4.35 : Motifs of Presenilin-1 and Signal peptide peptidase by HIGEDA HIGEDA Algorithm Family (# seq, max) / Found motif p-value % Hit Presenilin-1 (86,622) HIGEDA GLGDFIFYS 1.3e-10 85.00 KLGLGDFIFY 1.6e-10 85.00 MEME HWKGPLRLQQ 66.00 GLAM2 GDFIFYSLVL 85.00 PRATT Lx(l-3)Lx(2-3)I 100.0 Signal p. Peptidase (151, ,690) F[AS]MLGLGDIVIPG 8.9e-08 88.74 YDIFWVF[GF]T[NDP] 1.7e-08 86.75 GLF[IFV]YDIFWVF 8.7e-08 86.75 FGT[NDP]VMVTVA[KT] 8.6e-08 86.09 FWVFGT[ND]VMV MEME YDIFWVFGT[NDP]VMV [AS]MLGLGDIVIPGI 86.00 87.00 92.00 GLAM2 [FL][FI]YD[IV]F[WF]VF[GF] GL FFYDIFWVFGT 95.00 90.00 PRATT Lx3F 100.0 HIGEDA integrates HGA, EM and dynamic programming (DP) to find motifs in biological sequences. HIGEDA uses HGA to manage different motif seeds and to escape local optima more efficiently than conventional GAs by using gene-set levels. By applying DP and proposing a new technique for aligning sequences to a motif model, and then using EM with the dynamic alignments, the HIGEDA creates a new 194 way to automatically insert gaps in motif models. Using the gradient descent learning law, HIGEDA effectively estimates its model parameters without the greedy searches used in MEME. Using the pseudo-counts method based on a simple mechanism and the BLOSUM62 substitution probability matrix, HIGEDA avoids the small dataset problem of zeros in motif model elements, without using the Dirichlet mixture prior knowledge used in MEME and similar approaches. Because HIGEDA uses a set of motif seeds generated randomly, it outperforms MEME, GLAM2 and several other algorithms in finding subtle, more degenerate and gapped motifs. Lastly, we have shown that, as a TCM based algorithm, HIGEDA can identify novel motifs. 195 5. Applications 5.1 Recovery of a gene expression signature In this application, we apply our methods to recover the estrogen gene expression signature from the CMAP dataset. The flowchart is shown in Figure 5-1. Integrated gene expression data Cluster solution Competitors fzGASCE : Genetic algorithm with fSC and fzBLE Cluster candidates Fuzzy partition (1) Done fzSC : Fuzzy Subtractive Clustering Potential offsprings Immunized candidates Generation selection Figure 5-1 : Gene expression signature prediction The fzGASCE with fzSC, fzBLE, fzPBD and RPPR methods were applied to the estrogen treatment datasets in CMAP with the estrogen gene expression signature [85]. We first preprocessed the CMAP arrays using the RMA method to obtain 564 expression profiles. Using our meta-analysis method, RPPR [169], we selected the 196 expression profiles that were most strongly related to the estrogen expression signature [25, 26]. Expression profiles from the five experiments, #373, 988, 1021, 1079 and 1113 were selected. We then used the Rank-Product method [30, 54] across multiple experiments to filter probes differentially expressed using a p-value cutoff at 0.005. This process selected 204 probes. The expression levels of these 204 probes across the selected profiles were extracted and the fzGASCE method clustering was applied using Pearson Correlation distance. Four clusters were found. Each cluster, a set of probesets, is considered a candidate estrogen gene expression signature. We applied RPPR to each cluster using CMAP. Significantly related expression profiles, with cutoffs of 0.5 and -0.5 for positive and negative scores respectively, for each cluster are shown in Table 5.1. Table 5.1 : Clustering results with estrogen significantly associated drugs Cluster (#probesets) Associated drugs [CMAP experiments] (+/-hitscore) 1 (60) estradiol[373](+0.93), estradiol[1021](+0.85), estradiol[988] (+0.733), estradiol[1079](+0.69), doxycycline[l 1 13](+0.64), genistein[1073](+0.63), estradiol[365](+0.63), nordihydroguaiaretic acid[415](+0.59), alpha-estradiol[1048](+0.58), genistein[268](+0.57), estradiol[414](+0.57), genistein[1015](+0.56), dexamethasone[374](+0.54), estradiol[121](+0.53), alpha-estradiol[990](+0.52), genistein[267](+0.52), fulvestrant[1076](-0.75), fulvestrant[1043](-0.71), fulvestrant[985](-0.66), fulvestrant[367](-0.57), 5224221 [956](-0.55), 15-delta prostaglandin J2[1011](-0.54), 5230742[862](-0.54), fulvestrant[310](-0.52) 2 (53) doxycycline[l 1 13](+0.85), estradiol[988](+0.74), estradiol[1021](0.61), estradiol[373](0.58), 12,13-EODE[1108](+0.56) 3(44) estradiol[988](+0.84), estradiol[1079](+0.80), estradiol[1021](+0.58), alpha-estradiol[990](+0.52) 4 (47) estradiol[1021](+0.92), estradiol[988](+0.66), doxycycline[1113](+0.61), estradiol[1079](+0.58) 197 Table 5.2 : Estrogen signature and the 'signature' predicted by FZGASCE Affy-id Selected expression profiles predicted 373 988 1021 1079 1113 202437_s_at 1.72 1.92 1.97 1.68 1.75 X 206115_at 1.29 1.29 1.28 1.40 1.54 X 209339_at 1.66 1.80 1.77 1.76 1.82 X 209687_at 1.51 1.66 1.59 1.63 1.65 X 210367_s_at 1.62 1.77 1.64 1.70 1.71 X 210735_s_at 1.71 1.73 1.71 1.81 1.81 X 211421_s_at 1.35 1.55 1.47 1.63 1.59 X 213906_at 1.29 1.48 1.45 1.43 1.62 X 215771_x_at 1.45 1.69 1.71 1.79 1.63 X 215867_x_at 1.69 1.82 1.77 1.70 1.82 X 205879_x_at 1.50 1.81 1.68 1.68 1.95 X 205862_at 1.69 1.81 1.69 1.69 1.72 X 203666_at 1.55 1.61 1.60 1.60 1.68 X 203963_at 1.71 1.88 1.71 1.85 1.81 X 204508_s_at 1.60 1.82 1.76 1.93 1.91 X 204595_s_at 1.78 1.80 1.62 1.75 1.99 X 205380_at 1.31 1.59 1.50 1.64 1.84 X 205590_at 1.69 1.71 1.87 1.82 1.79 X 217764_s_at 1.62 1.83 1.74 1.85 1.84 X 204597_x_at 1.70 1.74 1.75 2.02 1.69 - 205239_at 1.71 1.73 1.77 1.62 1.81 - 204497_at 1.79 1.76 1.70 1.79 1.84 - Table 5.1 shows that cluster 1 strongly hit the estrogen expression profiles in CMAP. It is also the only cluster with a significant negative correlation with fulvestrant, an estrogen antagonist. This cluster is therefore the best candidate for the estrogen gene expression signature. Interestingly, this cluster contains 19 of 22 probesets from the 198 actual estrogen signature that were significantly regulated in the five selected profiles (Table 5.2). We therefore conclude that fzGASCE successfully discovered the estrogen gene expression signature from the CMAP datasets. 5.2 Gene-drug association prediction Our algorithm methods can be combined in a solution model for the problem of gene-drug association prediction. Given a set of genes that may describe completely or partially the biological phenotype of interest, the problem is to search the CMAP datasets for drug treatments that directly or indirectly alter their expression. 5.2.1 Model design Let A be the given set of genes. Gene set A Expanded set B - Key gene set B^ Gene expression signature generation r Predicted gene-drug associations Figure 5-2 : Gene-Drag association prediction 199 Gene set expansion Given A, a set of genes describe the biological phenotype Search for in-vivo interactors of A Apply KEGG pathway annotation r Select the genes having at least one interaction partner that is annotated in KEGG 1 Selected gene set: B AcB \^J Genes of the given set A KEGG-pathway annotated \_y Expanded set: B Figure 5-3 : Expansion of gene set A Because our knowledge of the molecular basis of the phenotype is limited, the set A probably does not describe the phenotype completely. To address this problem, we propose to expand the gene set using Protein-Protein Interaction (PPI) and pathway data. A gene g is added to set A if g interacts with at least one gene in set A, and g and/or its interaction partner in A is annotated as a component of some pathway. These criteria are chosen because, while physically interacting proteins are not necessarily 200 components of a biological process, a pair of interacting proteins which are associated with one pathway is more likely to participate in another pathway. We first use PPI data from the Human Protein Reference Database (HPRD) to expand set A. We then use KEGG pathway data to select proteins that meet the criteria. If a gene g, geA, is annotated as a component of any pathway, we will include genes that interact with g, whether or not they are annotated as pathway components. If gene g is not a component of any pathway, its interaction partners will be included only if they are annotated as components of at least one pathway. These included genes then suggest a novel potential role for g. Let B be the expanded gene set. B contains the set A and the proteins that interact with A, and may mediate the association of proteins in A with some pathways. Every pair of genes in B is annotated in at least one pathway. For example, consider a set of three genes, a, b, c, and their interaction partners, d, e, f (Figure 5-3). The interaction between the genes f and b is not annotated in any KEGG pathway. Therefore, f is not selected. The expanded set B then contains the genes a, b, c, d and e. The key genes Set B may be large, and it is possible that not all of its members are involved in a single pathway relevant to the biological phenotype. The next step, therefore, is to search for the genes that are most likely to interact with genes in A (Figure 5-4). We call these the key genes associated with the phenotype. Key genes may be members of A or the genes that interact with members of A. 201 r Annotate B using GO CC terms Create M B , a CC-terms based distance matrix for genes in B 1 Cluster B using M B to have sub-groups ofB:Bi,B 2 ,...,B k Select B A from {B i? i=l,k} where B A is the closest to A using GO BP based similarity Radius BA using PPI: BA , create the expression profiles SA for BA using random walk and heuristics Figure 5-4 : Identification of key genes We cluster the genes in B into groups where genes of the same group most often occupy the same cellular location to facilitate the interaction. We use the GO cellular component (CC) annotations to measure the distances between genes in B and then apply our clustering methods to classify genes in B into groups, Bi, B 2 ,..., Bk. These subsets may be the set A itself or have a member that interacts with at least one member of A and is located within the same cellular locations as that member of A. These sets Bi, i=l,. . .,k, are not necessarily related to A in a biological process, we therefore search for B^, where B^ e{Bi}i=i..k, and B^ is functionally closest to A. We address this problem using a semantic similarity measure based on GO Biological Process (BP) 202 annotations. The selected set B^ should have the highest semantic similarity to A. We define B A , B A = A u B^ , the key genes of the expanded set B of set A. Since the set B A was generated by examining known interactors of A that occupy similar cellular locations and are annotated to pathways, Ba is believed to better describe the biological phenotype of interest than the set A. Generate and apply gene expression signature In this step, we use B A and the CMAP gene expression microarray data to generate a gene expression signature for the biological phenotype. Because not all the genes in B A exhibit a change in all expression profiles in the dataset, the key issues in this step are to select expression profiles and determine the direction change in expression of genes in B A . We extract the fold change (FC) levels of genes in B A using all gene expression profiles in CMAP, compute p-values from FC to filter genes differentially expressed in each expression profile using Rank Product [30], and conduct cluster analysis to group all CMAP expression profiles over genes in B A into p groups, {Pi}i=i..p- We choose P A as the group of samples where the genes in B A are the most changed. We use the selected genes to define the gene expression signature, S A , for the phenotype. The FC levels of genes in S A are averaged across the expression profiles in P A to determine the expression direction. S A is the gene expression signature we suggest for the phenotype. 203 Key genes Select from CMAP a set of expression profiles related to S A : CMAP A ^r ^r Cluster CMAP A into m clusters of genes: Ci,...C m using gene expression levels Result- 1: A set of drugs/treatments that may target genes in A Select C A from {Q, i=l,m} where C A is the closest to A using expression levels Eliminate C A and clusters of unchanged genes Result-2: C A provides additional genes relating to the phenotype Result-3: C C A provides additional gene groups relating to the phenotype Figure 5-5 : Gene expression signature generation and application The predicted expression signature, Sa, is then used to identify a subset of CMAP, CMAPa which contains the expression profiles strongly relevant to the phenotype. The small-molecule/drugs used as treatments in CMAP A suggest potential therapeutics to modulate expression of the given gene set. By conducting cluster analysis on CMAP A and using the GO BP-based semantic similarity measure, we can also obtain the set Ca that contains additional genes potentially related to the biological 204 phenotype of interest, and the set C c a of additional groups of genes changed, but differently that did the genes in the given set A. 5.2.2 Model testing Intellectual disability (ID) in Down syndrome (DS) ranges from low normal to severely impaired, and has a significant impact on the quality of life of the individuals affected and their families. Development of pharmacotherapies for cognitive deficits is, therefore, an important goal. The working hypothesis in DS research is that the phenotypic features are caused by the modest 50% increases in expression of trisomic genes and that these in turn result in modest perturbations of otherwise normal cell functions and pathways. However, the primary difficulties of targeting therapeutics to Hsa21q genes are the determinations of the number of genes to be regulated simultaneously, the number of drugs to be used, and the gene and/or drug interactions might occur in full trisomies to produce adverse effects [171]. This application will help with identification of the set of drug treatments that may affect a given set of Hsa21q genes using the CMAP datasets. Hsa21 gene set 1 This gene set contains six Hsa21q genes: APP, DYRK1A, ITSN1, RCAN1, S100B and TIAM1 [171] as shown in Table 5.3. 1) As in Figure 5-3, to generate the expanded gene set, we used the PPI and pathway database in the Gardiner's lab [167] which is compiled from multiple bioinformatics resources [168], (available online at 205 http://gfuncpathdb.ucdenver.edu/iddrc/). We used the HPRD PPIs at "in- vivo" level and the KEGG pathways to filter the genes. The expanded set B of A is shown in Table 5.4. Table 5.3 : Hsa21 set-1 query genes No. Gene Description Chr. 1 APP amyloid beta (A4) precursor protein 21 2 DYRK1A dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1 A 21 3 ITSN1 intersectin 1 (SH3 domain protein) 21 4 RCAN1 regulator of calcineurin 1 21 5 S100B SI 00 calcium binding protein B 21 6 TIAM1 T-cell lymphoma invasion and metastasis 1 21 2) As in Figure 5-4, we conduct cluster analysis on B using fzGASCE and GOfzBLE methods with GO-CC annotations. Gene groups are then ranked based on how close they are to the given gene set A using GO-BP based semantic similarity measurement. Results are shown in Table 5.5. We choose the first group, which has the highest semantic similarity score to the set A, for the next steps. 206 Table 5.4 : Expanded set of the Hsa21 set-1 Selected proteins in BOLD are annotated in at least one pathway # Selected protein Interactor 1 APP SERPINA3, ABL1, APBA1, APBA2, APBB1, APBB2, CALR, CASP3, CDK5, CHRNA7, CLU, CTSD, F12, GRB2, GSK3B, GSN, HSD17B10, HMOX2, KNG1, KLC1, LRP1, NF1, PIN1, PPID, PSEN2, SHC1, TGFB1, TGFB2, NUMB, NAE1, NUMBL, ITM2B, HOMER3, HOMER2, MAPK8IP1, APBA3, BCAP31, APBB3, KAT5, NCSTN, BACE2, LDLRAP1, HTRA2, LRP1B, SHC3, NECAB3, COL25A1 2 DYRK1A CREB1, FOXOl, GLI1, YWHAB, YWHAE, YWHAG 3 ITSN1 DNM1, EPHB2, EPS15, HRAS, SNAP25, SOS1, SNAP23, WASL 4 RCAN1 MAP3K3, PPP3CA 5 S100B GUCY2D, MAPT, IQGAP1 6 TIAM1 ANK1, CAMK2G, CD44, EFNB1, EPHA2, HRAS, MYC, NME1, PRKCA, RAC1, SRC, TIAM1, YWHAG, MAPK8IP1, MAPK8IP2, PARD3, PPP1R9B Table 5.5 : Gene clusters of the Hsa21 gene set-1 expanded set Genes in BOLD are in the original gene set A # Genes Similarity score to A 1 APBA1(9), APBA2(15), APBBl(ll), APBB2(4), APP(21), CHRNA7(15), CLU(8), CREB1(2), CTSD(ll), DNM1(9), RCAN1(21), DYRK1A(21), EPHA2(1), EPHB2(1), KNG1(3), LRP1(12), MYC(8), NF1(17), PIN1(19), PPID(4), PSEN2(1), S100B(21), SNAP25(20), TGFB1(19), TGFB2(1), SNAP23(15), IQGAP1(15), NUMBL(19), HOMER3(19), KAT5(11), LRP1B(2), PPP1R9B(17) 0.32 2 SERPINA3(14), ABL1(9), ANK1(8), CALR(19), CAMK2G(10), CASP3(4), CD44(11), CDK5(7), EFNBl(X), EPS15(1), F12(5), FOX01(13), GLI1(12), GRB2(17), GSK3B(3), GSN(9), GUCY2D(17), HSD17B10(X), HMOX2(16), HRAS(ll), KLC1(14), MAPT(17), MAP3K3(17), NME1(17), PPP3CA(4), PRKCA(17), RAC1(7), ITSN1(21), SHC1(1), SOSl(2), SRC(20), TIAM1(21), YWHAB(20), YWHAE(17), YWHAG(7), NUMB(14), NAE1(16), WASL(7), ITM2B(13), HOMER2(15), MAPK8IP1(11), APBA3(19), BCAP31(X), APBB3(5), NCSTN(l), MAPK8IP2(22), BACE2(21), LDLRAPl(l), HTRA2(2), SHC3(9), PARD3(10), NECAB3(20), C0L25A1(4) 0.26 207 3) As in Figure 5-5, we search for the probesets of the selected genes and extract their expression levels. Differential expression significance is set using a cutoff p-value at 0.005. We conduct cluster analysis of expression profiles using our cluster method, fzGASCE where the fuzzifier factor, m, is set to 1.15. The expression profile clusters are ranked based on the average fold change of genes that are differentially expressed. The cluster with highest expressed level is selected. 4) We average the expression levels of the selected probesets (Sa) across selected profiles and assign the 'Down' label to probesets where the averaged FC value is less than 1 and vice versa. Results are shown in Table 5.6. We propose these genes with their direction of expression change as the expression signature for the phenotype most related to the given set A of Hsa21 genes. 5) We use the gene expression signature in Table 5.6 with RPPR [169] (support tool is available online [168]) to select the list of drug treatments in CMAP using a score cutoff at 0.8. Results are shown in Tables 5.7 and 5.8. Our approach successfully identifies drugs commonly used in treatment of mental illness. Mebhydrolin is used to treat depression in older ADULT with ID; cytochalasin B causes filament shortening, however, Krucker et al. [186] showed that the effect of cytochalasin B can be reversible making stable long-term potentiation in region CA1 of the hippocampus. Geldanamycin 208 can help with reducing brain injury; Levomepromazine is used to relieve symptoms of schizophrenia. All of these drugs are listed in Table 5.8. Table 5.6 : Proposed gene expression signature for the Hsa21 gene set-1 (67 g enes, 147 pr obesets) No Probeset Sign Gene Chr. No Probeset Sign Gene Chr. 1 202123 s at i ABL1 9 75 206689 x at i KAT5 11 2 207087 x at i ANK1 8 76 209192 x at i KAT5 11 3 208352 x at i ANK1 8 77 214258 x at i KAT5 11 4 205389_s_at i ANK1 8 78 212877_at i KLC1 14 5 208353 x at i ANK1 8 79 212878 s at i KLC1 14 6 205391 x at i ANK1 8 80 213656 s at i KLC1 14 7 205390 s at i ANK1 8 81 217512 at i KNG1 3 8 206679 at i APBA1 9 82 206054 at i KNG1 3 9 215148 s at i APBA3 19 83 57082 at i LDLRAP1 1 10 205146 x at i APBA3 19 84 221790 s at i LDLRAP1 1 11 202652 at i APBB1 11 85 203514 at i MAP3K3 17 12 204650 s at i APBB3 5 86 213013 at i MAPK8IP1 11 13 211277 x at i APP 21 87 213014 at i MAPK8IP1 11 14 200602 at i APP 21 88 205050 s at i MAPK8IP2 22 15 214953 s at i APP 21 89 208603 s at i MAPK8IP2 22 16 217867 x at i BACE2 21 90 20640 l_s_at i MAPT 17 17 212952 at i CALR 19 91 203928 x at i MAPT 17 18 212953 x at i CALR 19 92 203929 s at i MAPT 17 19 214315 x at i CALR 19 93 203930 s at i MAPT 17 20 214316 x at i CALR 19 94 20243 l_s_at i MYC 8 21 200935 at i CALR 19 95 202268 s at i NAE1 16 22 212669 at i CAMK2G 10 96 208759 at i NCSTN 1 23 214322 at i CAMK2G 10 97 210720 s at i NECAB3 20 24 212757 s at i CAMK2G 10 98 207545 s at i NUMB 14 25 202763 at i CASP3 4 99 209073 s at i NUMB 14 26 204490 s at i CD44 11 100 221280_s at i PARD3 10 27 212014_x_at i CD44 11 101 210094_s_at i PARD3 10 28 209835 x at i CD44 11 102 221526 x at i PARD3 10 29 210916 s at i CD44 11 103 221527 s at i PARD3 10 30 212063 at i CD44 11 104 202927_at i PIN1 19 31 217523 at i CD44 11 105 204186 s at i PPID 4 32 216056 at i CD44 11 106 204185 x at i PPID 4 33 204489 s at i CD44 11 107 202429 s at i PPP3CA 4 34 204247 s at i CDK5 7 108 202457 s at i PPP3CA 4 35 210123 s at i CHRNA7 15 109 202425 x at i PPP3CA 4 36 208791 at i CLU 8 110 215194 at i PRKCA 17 37 208792 s at i CLU 8 111 213093 at i PRKCA 17 38 222043 at i CLU 8 112 215195 at i PRKCA 17 39 214513 s at i CREB1 2 113 206923 at i PRKCA 17 40 204312 x at i CREB1 2 114 211373_s_at i PSEN2 1 41 204314 s at i CREB1 2 115 204262 s at i PSEN2 1 42 204313 s at i CREB1 2 116 20426 lsat i PSEN2 1 43 200766 at i CTSD 11 117 208641 s at i RAC1 7 44 209033 s at i DYRK1A 21 118 209686 at i S100B 21 209 Table 5.6 (con't.) No Probeset Sign Gene Chr. No Probeset Sign Gene Chr. 45 211541 s at i DYRK1A 21 119 202376 at i SERPINA3 14 46 211079 s at i DYRK1A 21 120 214853 s at i SHC1 1 47 203499 at i EPHA2 1 121 201469 s at i SHC1 1 48 217887 s at i EPS 15 1 122 217048 at i SHC1 1 49 217886 at i EPS 15 1 123 206330 s at i SHC3 9 50 215961 at i F12 5 124 209131 s at i SNAP23 15 51 205774 at i F12 5 125 209130 at i SNAP23 15 52 202724 s at i FOXOl 13 126 214544 s at i SNAP23 15 53 202723 s at i FOXOl 13 127 202508 s at i SNAP25 20 54 206646 at i GLI1 12 128 202507 s at i SNAP25 20 55 215075 s at i GRB2 17 129 212777 at i SOS1 2 56 209945 s at i GSK3B 3 130 212780 at i SOS1 2 57 200696 s at i GSN 9 131 221284 s at i SRC 20 58 214040 s at i GSN 9 132 221281 at i SRC 20 59 218120 s at i HMOX2 16 133 213324 at i SRC 20 60 218121 at i HMOX2 16 134 203085 s at i TGFB1 19 61 217080 s at i HOMER2 15 135 203084 at i TGFB1 19 62 212983 at i HRAS 11 136 209908 s at i TGFB2 1 63 203089 s at i HTRA2 2 137 220407 s at i TGFB2 1 64 211152 s at i HTRA2 2 138 209909 s at i TGFB2 1 65 213446 s at i IQGAP1 15 139 206409 at i TIAM1 21 66 210840 s at i IQGAP1 15 140 213135 at i TIAM1 21 67 200791 s at i IQGAP1 15 141 205810 s at i WASL 7 68 217731 s at i ITM2B 13 142 205809 s at i WASL 7 69 217732 s at i ITM2B 13 143 217717 s at i YWHAB 20 70 209297 at i ITSN1 21 144 208743 s at i YWHAB 20 71 35776 at i ITSN1 21 145 210317 s at i YWHAE 17 72 207322 at i ITSN1 21 146 210996 s at i YWHAE 17 73 210713 at i ITSN1 21 147 213655 at i YWHAE 17 74 209298 s at i ITSN1 21 Table 5.7 : Drugs predicted to enhance expression of Hsa21 gene set-1 Drug (concentration) Target or disease Estradiol (lOnM) Treatment of urogenital symptoms associated with post- menopausal atrophy of the vagina and/or the lower urinary tract. 4,5-dianilinophthalimide (lOmM) Potent inhibitors of amyloid P fibrillization. Colforsin (50mM) Novel drug for the treatment of acute heart failure. Fisetin (50mM) A compound that has been shown to be able to alleviate ageing effects in certain model organisms. 210 Table 5.8 : Drugs predicted to repress expression of Hsa21 gene set-1 Drug (concentration) Target or disease Mebhydrolin (4.8mM) Enhanced sedation with alcohol or other CNS depressants. Additive antimuscarinic effects with MAOIs, atropine, TCAs which are the treatment of major depressive disorder, panic disorder and other anxiety disorders, causing increased levels of norepinephrine and serotonin that act as communication agents between different brain cells. Cytochalasin B (20.8mM) A cell-permeable mycotoxin. It inhibits cytoplasmic division by blocking the formation of contractile microfilaments. It inhibits cell movement and induces nuclear extrusion. Geldanamycin (l.mM) A benzoquinone ansamycin antibiotic that binds to Hsp90 and inhibits its function. HSP90 proteins play important roles in the regulation of the cell cycle, cell growth, cell survival, apoptosis, angiogenesis and oncogenesis. Tanespimycin (l.mM) It helps cause the breakdown of certain proteins in the cell, and may kill cancer cells. It is a type of antineoplastic antibiotic and a type of HSP90 inhibitor. Vorinostat (lO.mM) A drug that is used to treat cutaneous T-cell lymphoma that does not get better, gets worse, or comes back during or after treatment with other drugs. It is also being studied in the treatment of other types of cancer. Vorinostat is a type of histone deacetylase inhibitor. Trichostatin A (l.mM , lOO.nM) An organic compound that serves as an antifungal antibiotic and selectively inhibits the class I and II mammalian histone deacetylase families of enzymes, but not class III HDACs. 211 Table 5.8 (con't.) Drug (concentration) Target or disease Levomepromazine (9.mM) Levomepromazine is used in palliative care to help ease distressing symptoms such as pain, restlessness, anxiety and being sick. It works on chemical substances acting on the nervous system in brain. Levomepromazine is also used to relieve symptoms of schizophrenia. Schizophrenia is a mental health condition that causes disordered ideas, beliefs and experiences. Symptoms of schizophrenia include hearing, seeing, or sensing things that are not real, having mistaken beliefs, and feeling unusually suspicious. Hsa21 gene set 2 Given a set of Hsa21 genes: APP, DSCAM, ITSN1, PCP4, RCAN1, TIAM1. Table 5.9 : Hsa21 set-2 query genes No. Gene Description Chr. 1 APP amyloid beta (A4) precursor protein 21 2 DSCAM Down syndrome cell adhesion molecule 21 3 ITSN1 intersectin 1 (SH3 domain protein) 21 4 PCP4 Purkinje cell protein 4 21 5 RCAN1 regulator of calcineurin 1 21 6 TIAM1 T-cell lymphoma invasion and metastasis 1 21 Table 5.10 shows the predicted gene expression signature for Hsa21 gene set 2. We apply the gene expression signature to search for the drug treatments that may affect genes in the given gene set using CMAP datasets. A list of drug treatments that positively regulates genes in the given set is shown in Table 5.1 1. Table 5.12 lists drugs 212 predicted to negatively affect these genes. These include geldanamycin which can help with reducing brain injury, podophyllotoxin which is used in the treatment of brain tumors, and chlorpromazine which is used in the treatment of various psychiatric illnesses. Table 5.10 : Predicted gene expression signature for the Hsa21 gene set-2 (70 genes, 157 probesets) No. AffylD Sign Gene Chr. No. AffylD Sign Gene Chr. 1 202123 s at i ABL1 9 80 210713 at i ITSN1 21 2 205390 s at i ANK1 8 81 209192_x_at i KAT5 11 3 208352 x at i ANK1 8 82 214258 x at i KAT5 11 4 207087_x_at i ANK1 8 83 206689 x at i KAT5 11 5 205389 s at i ANK1 8 84 212877_at i KLC1 14 6 20539 lxat i ANK1 8 85 212878_s_at i KLC1 14 7 208353 x at i ANK1 8 86 213656 s at i KLC1 14 8 206679_at i APBA1 9 87 206054 at i KNG1 3 9 209870 s at i APBA2 15 88 217512 at i KNG1 3 10 209871 s at i APBA2 15 89 57082 at i LDLRAP1 1 11 215148 s at i APBA3 19 90 221790 s at i LDLRAP1 1 12 205146 x at i APBA3 19 91 200784 s at i LRP1 12 13 202652 at i APBB1 11 92 200785 s at i LRP1 12 14 212972 x at i APBB2 4 93 203514 at i MAP3K3 17 15 213419 at i APBB2 4 94 213013 at i MAPK8IP1 11 16 216750 at i APBB2 4 95 213014_at i MAPK8IP1 11 17 216747 at i APBB2 4 96 208603_s_at i MAPK8IP2 22 18 40148 at i APBB2 4 97 205050 s at i MAPK8IP2 22 19 212970 at i APBB2 4 98 20243 l_s_at i MYC 8 20 212985 at i APBB2 4 99 202268 s at i NAE1 16 21 204650 s at i APBB3 5 100 208759 at i NCSTN 1 22 200602 at i APP 21 101 210720 s at i NECAB3 20 23 214953 s at i APP 21 102 211914 x at i NF1 17 24 211277 x at i APP 21 103 212676 at i NF1 17 25 217867 x at i BACE2 21 104 216115 at i NF1 17 26 200837 at i BCAP31 X 105 204325 s at i NF1 17 27 200935 at i CALR 19 106 211094 s at i NF1 17 28 212952 at i CALR 19 107 212678 at i NF1 17 29 214315 x at i CALR 19 108 210631 at i NF1 17 30 212953 x at i CALR 19 109 204323 x at i NF1 17 31 214316 x at i CALR 19 110 211095 at i NF1 17 32 214322 at i CAMK2G 10 111 207545 s at i NUMB 14 33 212669 at i CAMK2G 10 112 209073 s at i NUMB 14 34 212757 s at i CAMK2G 10 113 209615 s at i PAK1 11 35 202763 at i CASP3 4 114 221280 s at i PARD3 10 36 212063 at i CD44 11 115 221527_s_at i PARD3 10 37 204489 s at i CD44 11 116 221526 x at i PARD3 10 38 210916 s at i CD44 11 117 210094 s at i PARD3 10 213 Table 5.10 (con't.) No. AffylD Sign Gene Chr. No. AffylD Sign Gene Chr. 39 204490 s at i CD44 11 118 202927_at i PIN1 19 40 212014 x at i CD44 11 119 204186 s at i PPID 4 41 217523 at i CD44 11 120 204185 x at i PPID 4 42 209835 x at i CD44 11 121 202429 s at i PPP3CA 4 43 216056 at i CD44 11 122 202425 x at i PPP3CA 4 44 204247 s at i CDK5 7 123 202457 s at i PPP3CA 4 45 210123 s at i CHRNA7 15 124 215195 at i PRKCA 17 46 208792 s at i CLU 8 125 215194 at i PRKCA 17 47 208791 at i CLU 8 126 213093 at i PRKCA 17 48 222043 at i CLU 8 127 206923_at i PRKCA 17 49 200766 at i CTSD 11 128 20426 l_s_at i PSEN2 1 50 217341 at i DNM1 9 129 204262 s at i PSEN2 1 51 215116 s at i DNM1 9 130 211373_s_at i PSEN2 1 52 211484 s at i DSCAM 21 131 208641 s at i RAC1 7 53 202711 at i EFNB1 X 132 208370_s_at i RCAN1 21 54 203499 at i EPHA2 1 133 215253 s at i RCAN1 21 55 210651 s at i EPHB2 1 134 202376 at i SERPINA3 14 56 211165 x at i EPHB2 1 135 201469 s at i SHC1 1 57 209589 s at i EPHB2 1 136 214853 s at i SHC1 1 58 209588 at i EPHB2 1 137 206330 s at i SHC3 9 59 217887 s at i EPS 15 1 138 214544 s at i SNAP23 15 60 217886 at i EPS 15 1 139 209130 at i SNAP23 15 61 215961 at i F12 5 140 209131_s_at i SNAP23 15 62 205774 at i F12 5 141 202508_s_at i SNAP25 20 63 215075_s_at i GRB2 17 142 202507 s at i SNAP25 20 64 209945 s at i GSK3B 3 143 212777 at i SOS1 2 65 200696 s at i GSN 9 144 212780 at i SOS1 2 66 214040 s at i GSN 9 145 221281 at i SRC 20 67 218121 at i HMOX2 16 146 213324 at i SRC 20 68 218120 s at i HMOX2 16 147 221284 s at i SRC 20 69 217080 s at i HOMER2 15 148 203084 at i TGFB1 19 70 212983_at i HRAS 11 149 203085_s_at i TGFB1 19 71 202282 at i HSD17B10 X 150 220407_s_at i TGFB2 1 72 211152 s at i HTRA2 2 151 209908 s at i TGFB2 1 73 203089 s at i HTRA2 2 152 220406 at i TGFB2 1 74 217731 s at i ITM2B 13 153 209909 s at i TGFB2 1 75 217732 s at i ITM2B 13 154 213135 at i TIAM1 21 76 207322 at i ITSN1 21 155 206409 at i TIAM1 21 77 209298 s at i ITSN1 21 156 205810 s at i WASL 7 78 209297 at i ITSN1 21 157 205809 s at i WASL 7 79 35776 at i ITSN1 21 214 Table 5.11: Drugs predicted to enhance expression of Hsa21 gene set-2 Drug (concentration) Target or disease Estradiol (lOnM) Treatment of urogenital symptoms associated with post- menopausal atrophy of the vagina and/or the lower urinary tract. 4,5-dianilinophthalimide (lOmM) Potent inhibitors of amyloid P fibrillization. Colforsin (50mM) Novel drug for the treatment of acute heart failure. Table 5.12 : Drugs predicted to repress expression of Hsa21 gene set-2 Drug (concentration) Target or disease TrichostatinA(l.mM, lOO.nM) An organic compound that serves as an antifungal antibiotic and selectively inhibits the class I and II mammalian histone deacetylase families of enzymes, but not class III HDACs. Geldanamycin (l.mM) A benzoquinone ansamycin antibiotic that binds to Hsp90 and inhibits its function. HSP90 proteins play important roles in the regulation of the cell cycle, cell growth, cell survival, apoptosis, angiogenesis and oncogenesis. Podophyllotoxin (9.6mM) Used in the treatment of many cancers, particularly small cell lung carcinoma and testicular cancer and brain tumors. It arrests cell growth by inhibiting DNA topo-isomerase II, which causes double strand breaks in DNA. Cytochalasin B (20.8mM) A cell-permeable mycotoxin. It inhibits cytoplasmic, division by blocking the formation of contractile microfilaments. It inhibits cell movement and induces nuclear extrusion. 215 Table 5.12 (con't.) Drug (concentration) Target or disease Propafenone (10. 6mM) Used to prevent or treat serious irregular heartbeats (arrhythmia). Belongs to a class of drugs called antiarrhythmics. Chlorpromazine (1 1.2mM) It is used in the treatment of various psychiatric illnesses and is also used in the management of nausea and vomiting associated with terminal illness. Tanespimycin (l.mM) It helps cause the breakdown of certain proteins in the cell, and may kill cancer cells. It is a type of antineoplastic antibiotic and a type of HSP90 inhibitor. 5.3 Drug target prediction In this application, we search for the set of genes affected by a given set of drug treatments as follows: 1) Relate the given set of drug treatments with drug treatments in the CMAP a. Use Drug Bank chemical similarity. b. Based on overlap of drug targets. 2) Extract expression profiles of the CMAP drug treatments related to the given drug treatments. 3) Generate the gene expression signature as in Section 5.2.1. 4) Apply RPPR [169] to the gene expression signature to expand the selected expression profiles. 216 5) Apply our clustering methods to search for the group of genes most differentially expressed. This set of genes is the answer to the problem. 5.4 Application of HIGEDA into prediction regulatory motifs HIGEDA is capable of discovering new motifs in a sequence set. In this application, we use HIGEDA to discover transcription factor binding sites (TFBS) and miRNA complementary binding sites (MBS) in the genomic and mRNA sequences of genes within each cluster from Section 5.1. Figure 5-6 illustrates the proposed method. Integrated gene expression data r Biological context ~\ ^ ^ of interest v ) ^ i f Cluster analysis V J y f Gene groups i ^^ ^ ^ Motif ^ NCBI DB w Scanner f ^^_ ^^ A V Gene RefSeqs i f HIGEDA V J MBS y f f ~\ / ^ TFRQ l3CL[-1> w^ Figure 5-6 : Transcription factor and mircoRNA binding site prediction using HIGEDA 217 6. Conclusions and future work Our studies regarding microarray analysis included meta-analysis methods, cluster analysis of gene expression data, and motif finding in biological sequences. Use of meta-analysis methods should provide better results in analysis of microarray data obtained from multiple studies because of a stronger statistical basis, and a reduction in bias caused by results of any individual study. In cluster analysis, we showed that soft cluster methods are more appropriate for gene expression data because it allows genes to belong to multiple clusters, reflecting their participation in more than one biological process. Use of the FCM algorithm is optimal because it eliminates problems seen in the EM algorithm of slow convergence at regions of overlap between adjacent clusters and its assumption about the data distribution model may violate that of most microarray data that do not follow a specific distribution model, such as Gaussian or Chi-Squared. We also proved that, by using the possibility to probability model conversion, we can apply probability based model selection to possibility based models. We therefore benefit from the power of statistical methods but also effectively model the natural data distributions because of the flexibility of possibility based models. The advantages of our approach have been shown in our novel methods of fuzzy cluster validation using either gene expression data or GO annotations, missing data imputation, fuzzy cluster defuzzification and cluster number detection. 218 In the motif-finding problem, we showed that a PWM based model is most appropriate. With the use of the hierarchical genetics algorithm where patterns of sub- sequences in motifs can change simultaneously, the motif-finding algorithm can avoid local optima. In addition, with the use of the dynamic programming algorithm, gapped motifs can be discovered. In this dissertation, we have made the following contributions to bioinformatics research: 1) Development of a meta-analysis method to effectively filter differentially expressed genes using external metrics, information based on the specific biological context, and prior knowledge of existing biological pathways, PPIs and Gene Ontology. The method is applicable to cases where data come from different laboratories using different versions of different microarray platforms, and should provide more biologically relevant results. We further propose a novel use of the Connectivity Map datasets to predict gene-drug associations. 2) Development of a novel clustering algorithm that combines algorithms developed for FCM parameter initialization, FCM missing data imputation, defuzzification of FCM fuzzy partition and fuzzy cluster evaluation with the optimization algorithm, GA. The algorithm is applicable to cluster analysis of gene expression data without a prior specification of the number of clusters and assumption of the data distributions. Our clustering method successfully supported specific applications described in Section 5. 219 3) Development of a novel motif-finding algorithm for the problem of gene regulatory sequence prediction for groups of genes with similar expression patterns under some biological conditions of interest, demonstrated in Section 4.3.2. 4) Development of a machine learning based framework for a complete solution to the problem of microarray data analysis. 5) We have developed a website [168] where the methods of this dissertation are made available online. The website uses R software as the computational back-end so that our methods can work seamlessly with other statistical, machine learning and data mining methods available in R packages. We will update our website with our machine learning based framework to allow microarray data analysis online. In future work, we will improve our algorithm methods with parallelization and implement them as web applications to support gene-drug-disease association prediction. These applications will be run on High Performance Computing (HPC) and accessible online. In addition, we will adapt our methods to other research domains such as Magnetic Resonance Imaging (MRI) data analysis, Social network information retrieval, cluster analysis of internet websites and users data. For the MRI data analysis, our cluster methods can help with image segmentation, marker identification and cluster analysis. For social network information retrieval, our cluster methods can help with identification the group of users having similar social behaviors, namely entertainment, buying and travelling. 220 REFERENCES [I] Stekel D. (2003) Microarray Bioinformatics, Cambridge University Press. [2] Kohane I.S., Kho A.T. and Butte AJ. (2003) Microarrays for an Integrative Genomics, MIT Press. [3] Amaratunga D. and Cabrera J. (2004) Exploration and Analysis of DNA Microarray and Protein Array Data, John Wiley & Sons, Inc., Hoboken, New Jersey. [4] Ewens W. and Grant G. (2005) Statisical Methods in Bioinformatics: An Introduction 2ed, Springer. [5] Shewhart W.A. and Wilks S.S. (2009) Batch effects and noise in microarray experiments, Wiley. [6] Barrett T., Suzek T.O., Troup D.B., Wilhite S.E., Ngau W.C., Ledoux P., Rudnev D., Lash A.E., Fujibuchi W. and Edgar R. (2005) NCBI GEO: mining millions of expression profiles—database and tools. Nucleic Acids Res. Vol. 33, D562-D566. doi: 10.1093/nar/gki022. [7] Barrett T., Troup D.B., Wilhite S.E., Ledoux P., Rudnev D., Evangelista C, Kim I.F., Soboleva A., Tomashevsky M. and Edgar R. (2007) NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Res. Vol. 35, D760-D765. doi: 10.1093/nar/gkl887. [8] Koziol A.J. (2011) Comments on the rank product method for analyzing replicated experiments, Elsevier -FEBS letters, Vol. 584, pp. 941-944. [9] Witten D. and Tibshirani R. (2007) A comparison of fold change and the t-statistic for microarray data analysis, Technical report, Stanford University. [10] Deng X., Xu J., Hui J. and Wang C. (2009) Probability fold change: A robust computational approach for identifying differentially expressed gene lists, Computer methods and Programs in Biomedicine, Elsevier, Vol. 93, pp. 124-139. [II] Warnat P., Eils R. and Brors B. (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, Vol. 6, pp. 265. [12] Xu L., Geman D. and Winslow R. (2007) Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics Vol. 8, pp. 275. [13] Xu L., Tan A.C., Winslow R.L. and Geman D. (2008) Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics, Vol. 9, pp. 125. 221 [14] Ramaswamy S., Ross K.N., Lander E.S. and Golub T.R. (2003) A molecular signature of metastasis in primary solid tumors. Nat Genet, Vol. 33, pp. 49-54. [15] Bloom G., Yang I.V., Boulware D., Kwong K.Y., Coppola D., Eschrich S., Quackenbush J. and Yeatman T.J. (2004) Multi-Platform, Multi-Site, Microarray-Based Human Tumor Classification. Am J. Pathol, Vol. 164, pp. 9-16. [16] Shen R., Ghosh D. and Chinnaiyan A.M. (2004) Prognostic meta- signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics, Vol.5, pp. 94. [17] Jiang H., Deng Y., Chen H.S., Tao L., Sha Q., Chen J., Tsai C.J. and Zhang S. (2004) Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics, Vol 5, pp.81. [18] Warnat P., Eils R. and Brors B. (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, pp. 265. [19] Rao Y., Lee Y., Jarjoura D., Ruppert A.S., Liu C.G., Hsu J.C. and Hagan J.P. (2008) A Comparison of Normalization Techniques for Microarray Data. Berkeley Electronic Press, Vol. 7. [20] Vert G., Nemhauser J.L., Geldner N., Hong F. and Chory J. (2005) Molecular Mechanisms of Steroid Hormone Signaling in Plants. The Annual Review of Cell and Developmental Biology. [21] Cahan P., Rovegno F., Mooney D., Newman J.C, Laurent III G. and McCaffrey T. (2007) Meta- analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene, Elsevier, pp. 12-1 8. [22] Irizarry R.A., Warren D., Spencer F., Kim I.F., Biswal S., Frank B.C., Gabrielson E., Garcia J.G., Geoghegan J., Germino G., Griffin C, Hilmer S.C., Hoffman E., Jedlicka A.E., Kawasaki E., Martinez-Murillo F., Morsberger L., Lee H., Petersen D., Quackenbush J., Scott A., Wilson M., Yang Y., Ye S.Q. and Yu W. (2005) Multiple-laboratory comparison of microarray platforms. Nat Methods, Vol. 2, pp.239-243. [23] Ramasamy A., Mondry A., Holmes C.C. and Altman D.G. (2008) Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets, PLoS, Vol. 5, pp. 184-196. 222 [24] Warnat P., Eils R. and Brors B. (2006) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics. Vol. 6, pp. 265. [25] Lamb J., Crawford E.D., Peck D., Modell J.W., Blat I.C., Wrobel M.J., Lerner J., Brunet J.P., Subramanian A., Ross K.N., Reich M., Hieronymus H., Wei G., Armstrong S.A., Haggarty S.J., demons P.A., Wei R., Carr S.A., Lander E.S. and Golub T.R. (2006) The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science. Vol. 313, pp. 1929-35. [26] Zhang S.D., Gant T.W. (2008) A simple and robust method for connecting small-molecule drugs using gene-expression signatures. BMC Bioinformatics, Vol. 9, pp. 258-264. [27] Xu L. and Jordan M.I. (1996) On convergence properties of the EM algorithm for Gaussian Mixtures. Neural Computation, Vol. 8, pp. 409-1431. [28] Ma S. and Dai Y. (2011) Principal component analysis based methods in bioinformatics studies, Bioinformatics, Vol. 10, pp. 1-9. [29] Breitling R., Armengaud P., Amtmann A. and Herzyk P. (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters, Vol. 573, pp. 83-92. [30] Hong F., Breitling R., McEntee C.W., Wittner B.S., Nemhauser J.L. and Chory J. (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis, Bioinformatics, Vol. 22, pp. 2825-2827. [31] Zhang N.R., Wildermuth M.C. and Speed T.P. (2008) Transcription Factor Binding site prediction with multivariate gene expression data, Annual applied statistic, Vol. 2, pp. 332-365. [32] Alexiou P., Maragkakis M., Papadopoulos G.L., Simmosis V.A., Zhang L. and Hatzigeorgiou A.G. (2010) The DIANA-mirExTra Web Server: From Gene Expression Data to MicroRNA Function, PLOSONE,Vol. 5,e.9172. [33] Das M.K. and Dai H.K. (2007) A survey of DNA motif finding algorithms, BMC Bioinformatics, Vol. 8, S21. 223 [34] Bailey T.L. and Elkan C. (1995). The value of prior knowledge in discovering motifs with MEME. Proc. Intl. Conf. Intel. Syst. Mol. Biol., 3, 21-29. [35] Kel A.E., Kel-Margoulis O.V., Farnham P.J., Bartley S.M., Wingender E. and Zhang M.Q. (2001) Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. Journal of Molicular Biology, Vol. 25, pp. 99-120. [36] Rhodes D.R., Barrette T.R., Rubin M.A., Ghosh D. and Chinnaiyan A.M. (2002) Meta-Analysis of Microarrays: Interstudy Validation of Gene Expression Profiles Reveals Pathway Dysregulation in Prostate Cancer. Cancer Res, Vol. 62, pp. 4427-4433. [37] Choi J.K, Yu U., Kim S. and Yoo O.J. (2003) Combining multiple microarray studies and modeling inter study variation. Bioinformatics, Vol. 19, pp. 84-90. [38] Choi J.K., Choi J.Y., Kim D.G., Choi D.W., Kim B.Y., Lee K.H., Yeom Y.I., Yoo H.S., Yoo O.J. and Kim S. (2004) Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Letters, Vol. 565, pp. 93-100. [39] Hu P., Greenwood CM. and Beyene J. (2005) Integrative analysis of multiple gene expression profiles with quality- adjusted effect size models. BMC Bioinformatics, Vol. 6, pp. 128. [40] Marot G. and Mayer CD. (2009) Sequential Analysis for Microarray Data Based on Sensitivity and Meta-Analysis. Berkeley Electronic Press, Vol. 8. [41] Strimmer K. and Opgen-Rhein R. (2007) Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach. Berkeley Electronic Press, Vol. 6. [42] Lin S. (2010) Space Oriented Rank-Based Data Integration. Berkeley Electronic Press, Vol.9. [43] Parkhomenko E., Tritchler D. and Beyene J. (2009) Sparse Canonical Correlation Analysis with Application to Genomic Data Integration. Berkeley Electronic Press, Vol.8. [44] Su Z., Hong H., Fang H., Shi L., Perkins R. and Tong W. (2008) Very Important Pool (VIP) genes - an application for microarray-based molecular signatures. BMC Bioinformatics, Vol. 9, S9. [45] Rossell D., Guerra R., Scott C (2008) Semi-Parametric Differential Expression Analysis via Partial Mixture Estimation. Berkeley Electronic Press, Vol. 7. 224 [46] Zhang S. (2006) An Improved Nonparametric Approach for Detecting Differentially Expressed Genes with Replicated Microarray Data. Berkeley Electronic Press, Vol. 5. [47] Tan F., Fu X., Zhang Y. and Bourgeois A.G. (2006) Improving Feature Subset Selection Using a Genetic Algorithm for Microarray Gene Expression Data. IEEE Congress on Evolutionary Computation. [48] Choi J.K., Choi J.Y., Kim D.G., Choi D.W., Kim B.Y., Lee K.H., Yeom Y.I., Yoo H.S., Yoo O.J. and Kim S. (2004) Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Letter, Vol. 565, pp. 93-100. [49] Huttenhower C, Hibbs M., Myers C. and Troyanskaya O.G. (2006) A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics, Vol. 1, pp. 2890-2896. [50] Rodriguez-Zas S.L., Ko Y., Adams H.A. and Southey B.R. (2008) Advancing the understanding of the embryo transcriptome co-regulation using meta-, functional, and gene network analysis tools. Reproduction. Vol. 135, pp. 213-236 [51] Burguillo F.J., Martin J., Barrera I. and Bardsley W.G. (2010) Meta-analysis of microarray data: The case of imatinib resistance in chronic myelogenous leukemia. Comput Biol Chem, Vol. 34(3), pp. 184-192. [52] Kadota K., Nakai Y. and Shimizu K. (2009) Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms Mol Biol. Vol. 22. [53] Xu L., Tan A.C., Winslow R.L. and Geman D. (2008) Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics, Vol. 27, pp. 9. [54] Hong F. and Breitling R. (2008) A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinfomatics, Vol. 24, pp. 374-382. [55] Yuan T., Gui-xia L., Ming Z., Yi Z. and Chun-guang Z. (2010) In Proc. of the 2010 2nd International on Signal Processing Systems (ICSPS), Vol. 3, pp. 484-488. 225 [56] Xu L., Geman D. and Winslow R.L. (2007) Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics, Vol. 8, pp. 275+. [57] Romdhane L.B., Ayeb B. and Wang S. (2002) On computing the fuzzifier in iFLVQ: A data driven approach, International Journal of Neural Systems, Vol. 12, pp. 149-157. [58] Bezdek J.C., Pal L.R. and Hathaway R.J. (1996) Sequential competitive learning and the Fuzzy C- means clustering algorithms. Neural Networks, Vol. 9, pp. 787-796. [59] Bezdek J.C. and Pal N.R. (1995) Two soft relatives of learning vector quantization. Neural Networks, Vol. 8, pp. 729-743. [60] Dembele D. and Kastner P. (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics, Vol. 19, pp. 973-980. [61] Schwammle V. and Jensen O.N. (2010) A simple and fast method to determine the parameters for fuzzy C-Means cluster analysis, Bioinformatics, Vol. 26, pp. 2841-2848. [62] Zhao Q., Li Q. and Xing S. (2010) FCM Algorithm Based on the Optimization Parameters of Objective Function Point, Proc. of 2010 International Conference on Computing, Control and Industrial Engineering, Vol. 2. [63] Chiu S.L. (1994) Fuzzy Model Identification Based on Cluster Estimation. Journal of Intelligent and Fuzzy Systems, Vol. 2, pp. 267-278. [64] Li J., Chu C.H., Wang Y. and Yan W. (2002) An Improved Fuzzy C-means Algorithm for Manufacturing Cell Formation, IEEE. [65] Liu W., Xiao C.J., Wang B.W., Shi Y. and Fang S.F. (2003) Study On Combining Subtractive Clustering With Fuzzy C-Means Clustering, In Proc. of the Second International Conference on Machine Learning and Cybernetics. [66] Yang M.S. and Wu A.K.L. (2005) A modified mountain clustering algorithm, Pattern Anal Applic, Vol. 8, pp. 125-138. [67] Yang Q., Zhang D. and Tian F. (2010) An initialization method for fuzzy c-means algorithm using subtractive clustering. In Proc. of 2010 Third International Conference on Intelligent Networks and Intelligent Systems. 226 [68] Cheng J.Y., Quin C. and Jia J. (2008) A weighted mean Subtractive Clustering, Info. Technology Journal, Vol. 7, pp. 356-360. [69] Collins F.S., Green E.D., Guttmacher A.E. and Guye M.S. (2003) A vision for the future of genomics research. Nature Vol. 422, pp. 835-847 [70] Bailey T.L. and Elkan C. (1995) The value of prior knowledge in discovering motifs with MEME. Proc. Intl. Conf. Intel. Syst. Mol. Biol. Vol. 3, pp. 21-29. [71] Bi C. (2007) A genetic-based EM motif- finding algorithm for biological sequence analysis. Proc. IEEE Symp. Comput. Intel. Bioinfo. Comput. Biol. pp. 275-282. [72] Chang X., Zhou W., Zhou C. and Liang Y. (2006) Prediction of transcription factor binding sites using genetic algorithm. 1st Conf. Ind. Elec. Apps. pp. 1-4. [73] Frith M.C., Saunders N.F., Kobe B. and Bailey T.L. (2008) Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol. Vol. 4, el 000071 [74] Henikoff J.G. and Henikoff S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comp. App. Biosci. Vol. 12, pp. 135-143. [75] Hong T.P. and Wu M.T. (2008) A hierarchical gene-set genetic algorithm. J. Comp. Vol. 3, pp. 67- 75. [76] Li L., Liang Y. and Bass R.L. (2007) GAPWM: a genetic algorithm method for optimizing a position weight matrix. Bioinformatics, Vol. 23, pp.1 188-1 194. [77] Li L., Bass R.L. and Liang Y. (2008) fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control. Bioinformatics, Vol. 24, pp. 629-636. [78] Liu D, Xiong X., Dasgupta B and Zhang H. (2006) Motif discoveries in unaligned molecular sequences using self-organizing neural networks. IEEE Trans. Neural Networks, Vol. 17, pp.9 19- 928. [79] Nowakowski S. and Tiuryn J. (2007) A new approach to the assessment of the quality of predictions of transcription factor binding sites. J. Biomed. Info. Vol. 40, pp. 139-149. [80] Osada R., Zaslavsky E. and Singh M. (2004) Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics, Vol. 20, pp.35 16-3525. 227 [81] Pisanti N., Crochemore M. Grossi R. and Sagot M.F. (2005) Bases of motifs for generating repeated patterns with wildcards. IEEE/ACM Trans. Comput. Biol, and Bioinfo. Vol. 2, pp.40-50. [82] Touzet H. and Varre J.S. (2007) Efficient and accurate P-value computation for position weight matrices. Algorithms for Mol. Biol. Vol. 2, pp. 15-26. [83] Wei Z. and Jensen S.T. (2006) GAME: Detecting cis-regulatory elements using a genetic algorithm. Bioninformatics Vol. 22, pp. 1577-1584. [84] Frank A. and Asuncion A. (2010) Machine Learning Repository, [Online], http://archive.ics.uci.edu/ml. [85] Cho R.J., Campbell M.J., Winzeler E.A., Steinmetz L., Conway A., Wodicka L., Wolfsberg T.G., Gabrielian A.E., Landsman D., Lockhart D.J. and Davis R.W. (1998) A genome-wide transcriptional analysis of the mitotic cell-cycle. Molecular Cell, Vol. 2, pp. 65-73. [86] Mewes H.W., Frishman D., Giildener U., Mannhaupt G., Mayer K., Mokrejs M., Morgenstern B., Munsterkotter M., Rudd S. and Weil B. (1999) MIPS: a database for protein sequences and complete genomes, Nucleic Acids Research, Vol. 27, pp. 44-48. [87] Wen X., Fuhrman S., Michaels G.S., Carr D.B., Smith S., Barker J.L. and Somogyi R. (1998) Large-scale temporal gene expression mapping of central nervous system development. Proceeding of the National Academy of Science USA, Vol. 95, pp. 334-339. [88] Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T. and Tibshirani R. (2011) Missing value estimation methods for DNA microarrays. Bioinformatics, Vol. 17, pp. 520-525. [89] Shi F., Abraham G., Leckie C, Haviv I. and Kowalczyk A. (2011) Meta-analysis of gene expression microarrays with missing replicates, BMC Bioinformatics, Vol. 12, pp. 84-99. [90] Pedro J., Curado J. and Church G.M. (2009) Meta-analysis of age-related gene expression profiles identifies common signatures of aging, Bioinformatics, Vol. 25, pp. 875-881. [91] Zhang Z., Chen D. and Fenstermacher D. (2007) Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome, BMC Genomics, Vol. 8, pp.33 1-343. 228 [92] Bezdek J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, NewYork. [93] Fukuyama Y. and Sugeno M. (1989) A new method of choosing the number of clusters for the fuzzy c-means method. In Proc. of the Fifth Fuzzy Systems Symp., pp. 247-250. [94] Xie X.L. and Beni G. (1991) A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell, Vol. 13, pp. 841-847. [95] Rezaee M.R., Lelieveldt B.P.F and Reiber J.H.C (1998) A new cluster validity index for the fuzzy c-mean, Pattern Recognition Letter, Vol. 19, pp. 237-246. [96] Pakhira M.K., Bandyopadhyay S. and Maulik U. (2004) Validity index for crisp and fuzzy clusters, Pattern Recognition, Vol. 37, pp. 481-501. [97] Rezaee B. (2010) A cluster validity index for fuzzy clustering, Fuzzy Sets and Systems, Vol. 161, pp. 3014-3025. [98] Florea M.C., Jousselme A.L., Grenier D. and Bosse E. (2008) Approximation techniques for the transformation of fuzzy sets into random sets, Fuzzy Sets and Systems, Vol. 159, pp. 270-288. [99] Xu L. and Jordan M.I. (1996) On convergence properties of the EM algorithm for Gaussian mixtures, Neural Computation, Vol. 8, pp. 129-151. [100] Somogyi R., Wen X., Ma W. and Barker J.L. (1995) Developmental kinetic of GLAD family mRNAs parallel neurogenesis in the rat, Journal of Neurosciences, Vol. 15, pp. 2575-2591. [101] Wen X., Fuhrman S., Michaels G.S., Carr G.S., Smith D.B., Barker J.L., Somogyi R. (1998) Large scale temporal gene expression mapping of central nervous system development. In Proc. of the National Academy of Science USA, Vol. 95, pp. 334-339. [102] Yeung K.Y., Haynor D.R. and Ruzzo W. (2001) Validating clustering for gene expression data. Bioinformatics, Vol. 17, pp. 309-318. [103] Yeung K.Y., Fraley C, Murua A., Raftery A.E. and Ruzzo W.L. (2001) Model based clustering and data transformations for gene expression data, Bioinformatics, Vol. 17, pp. 977-987. [104] Mewes H.W., Hani J., Pfeiffer F. and Frishman D. (1998) MIPS: A database for protein sequences and complete genomes, Nucleic Acids Research, Vol. 26, pp. 33-37. 229 [105] Liu W.Y., Xaio C.J., Wang B.W., Shi Y. and Fang S.F. (2003) Study On Combining Subtractive Clustering With Fuzzy C-Means Clustering, in Proc. of Machine Learning and Cybernetics, International Conference, Xi'an, pp. 2659-2662. [106] Yang M.S. and Wu K.L. (2005) A modified mountain clustering algorithm, Pattern Anal Applic, Vol. 8, pp. 125-138. [107] Chen J.Y., Quin Z. and Jia J. (2008) A weighted mean subtractive clustering algorithm, Information Technology, Vol. 7, pp. 356-360. [108] Tuan C.C., Lee J.H. and Chao S.J. (2009) Using Nearest Neighbor Method and Subtractive Clustering-Based Method on Antenna-Array Selection Used in Virtual MIMO in Wireless Sensor Network, in Proc. of the Tenth International Conference on Mobile Data Management: Systems, Services and Middleware, Taipei, pp. 496-501. [109] Collazo J.C., Aceves F.M., Gorrostieta E.H., Pedraza J.O., Sotomayor A.O. and Delgado M.R. (2010) Comparison between Fuzzy C-means clustering and Fuzzy Clustering Subtractive in urban air pollution, in Proc. of Electronics, Communications and Computer (CONIELECOMP), 2010 20th International Conference, Cholula, pp. 174-179. [110] Yang Q., Zhang D. and Tian F. (2010) An initialization method for Fuzzy C-means algorithm using Subtractive Clustering, Third International Conference on Intelligent Networks and Intelligent Systems, Vol. 10, pp. 393-396. [Ill] Li J., Chu C.H., Wang Y. and Yan W. (2002) An Improved Fuzzy C-means Algorithm for Manufacturing Cell Formation, Fuzzy Systems, Vol. 2, pp. 1505-1510. [112] Loquin K. and Strauss O. (2008) Histogram density estimators based upon a fuzzy partition", Statistics and Probability Letters, Vol. 78, pp. 1863-1868. [113] Houte B.P.P. and Heringa J. (2010) Accurate confidence aware clustering of array CGH tumor profiles, Biolnformatics, Vol. 26, pp. 6-14. [114] Klinge CM. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids Research, Vol. 29, pp.2905-2919. 230 [115] Salama R.A. and Stekel DJ. (2010) Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction. Nucleic Acids Research, Vol. 38, pp. el 35. [116] Wang H., He X., Band M., Wilson C. and Liu L. (2005) A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics, Vol. 6, pp. 71. [117] Fisher R.A. (1925) Statistical methods for research workers, 13th ed. Oliver & Loyd, London. [118] Tavazoie S., Hughes J.D., Campbell M.J., Cho R.J. and Church G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet. 22, 281-285. [119] Dayangac-Erden D., Bora G., Ayhan P., Kocaefe Q., Dalkara S., Yelekci K., Demir A.S. and Erdem-Yurter H. (2009), Histone Deacetylase Inhibition Activity and Molecular Docking of (E)- Resveratrol: Its Therapeutic Potential in Spinal Muscular Atrophy. Chemical Biology & Drug Design, Vol. 73, pp. 355-364. [120] Faumont N., Durand-Panteix S., Schlee M., Gromminger S., Schuhmacher M., Holzel M., Laux G., Mailhammer R., Rosenwald A., Staudt L.M., Bornkamm G.W. and Feuillard J. (2009) c-Myc and Rel/NF-kappaB are the two master transcriptional systems activated in the latency III program of Epstein-Barr virus-immortalized B cells. Journal of Virology, Vol. 83, pp. 5014-5027. [121] Dunn J.C. (1974) A fuzzy relative of the ISODATA process and its use in detecting compact well- separated clusters, J. Cybernet, Vol. 3, pp. 32-57. [122] Hoppner F., Klawomm F., Kruse R. and Runkler T. (2000) Fuzzy cluster analysis - methods for classification, data analysis and image recognition, John Wiley and Son Ltd. [123] Dunn J. (1974) Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, Vol. 4, pp. 95-104. [124] Wang W. and Zhang Y. (2007) On fuzzy cluster validity indices, Journal of Fuzzy Sets and Systems, Vol. 158, pp. 2095 - 2117. [125] Gath I. and Geva A.B. (1989) Unsupervised Optimal Fuzzy Clustering, Pattern Analysis and Machine Intelligence, Vol.1 1, pp.773-781. 231 [126] Krishnnapuram R. and Freg C.P. (1992) Fitting an unknown number of lines and planes to image data through compatible cluster merging. Pattern Recognition, Vol. 25, pp. 385-400. [127] Kim Y., Kim D., Lee D. and Lee K.H. (2004) A cluster validation index for GK cluster analysis based on relative degree of sharing. Information Science, 168, pp. 225-242. [128] Kaymak U. and Babu^ska R. (1995) Compatible cluster merging for fuzzy modeling, In Proc. IEEE Int. Conf. Fuzzy System, Yokohama, Japan, Mar. 1995, pp. 897-904. [129] Tasdemir K. and Merenyi E. (2011) Exploiting Data Topology in Visualization and clustering of Self-Organizing Maps, Neural Networks, Vol. 20, pp. 549-562. [130] Kaymak U. and Setnes M. (2002) Fuzzy Clustering With Volume Prototypes and Adaptive Cluster Merging, Fuzzy Systems, Vol. 10, pp. 705-71 1. [131] lam-on N., Boongoen T. and Garrett S. (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis, Bioinformatics, Vol. 26, pp. 1513-1519. [132] Valente de Oliveira J. and Pedrycz W. (2007) Advances in Fuzzy Clustering and its Applications, John Wiley & Sons Ltd. [133] Bezdek J.C. (1980) A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms, Pattern analysis and Machine intelligence, Vol. 2, pp. 1-8. [134] Bezdek J.C, Hathaway R.J., Sabin M.J. and Tucker W.T. (1987) Convergence Theory for Fuzzy c- Means: Counterexamples and Repairs, Systems, Man AND Cybernetics, Vol. 17, pp. 873-877. [135] Cohen J., Cohen P., West S.G. and Aiken L.S. (2002) Applied multiple regression/correlation analysis for the behavioral sciences, 3 rd ed. Psychology Press. [136] Hedges L.V. and Olkin I. (1985) Statistical Methods for Meta- Analysis, Orlando: Academic Press. [137] Li D., Zhong C. and Zhang L. (2010) Fuzzy c-means clustering of partially missing data sets based on statistical representation, In Proc. of Intl' Conf. Fuzzy Systems and Knowledge Discovery, pp. 460-464, Dalian, China Aug 2010. [138] Te-Shun C, Yen K.K., Liwei A., Pissinou N. and Makki K. (2007) Fuzzy belief pattern classification of incomplete data Systems, pp. 535-540, In Proc. Intl' Systems, Man and Cybernetics, Oct 2007 Montreal CA. 232 [139] Garcia-Laencina P. J., Sancho-Gomez J.L. and Figueiras-Vidal A.R. (2010) Pattern classification with missing data: a review, Neural Computing & Applications, Vol. 19, pp. 263-282. [140] Hathaway R.J. and Bezdek J.C. (2001) Fuzzy c-Means Clustering of Incomplete Data, Systems, Man and Cybernetics, Vol. 31, pp. 735-744. [141] Luo J.W., Yang T. and Wang Y. (2005) Missing Value Estimation For Microarray Data Based On Fuzzy C-means Clustering, In Proc. Intl' Conf. on High-Performance Computing, Changsha China, Dec 2003. [142] Mohammadi A. and Saraee M.H. (2008) Estimating Missing Value in Microarray Data Using Fuzzy Clustering and Gene Ontology, In Proc. Intl' Conf. on Bioinformatics and Biomedicine, pp. 382-385, Washington, DC, USA. [143] Himmelspach L. and Conrad S. (2010) Clustering approaches for data with missing values: Comparison and evaluation, In Proc. of Intl' conf. on Digital Information Management, pp. 19-28, Thunder Bay, ON. [144] Wu K.L. (2010) Parameter Selections of Fuzzy C-Means Based on Robust Analysis, Engineering and Technology, Vol. 65, 554-557. [145] Le T., Altman T. and Gardiner J.K. (2010) HIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences, Bioinformatics, Vol. 26, pp. 302-309. [146] Le T. and Altman T. (201 1) A new initialization method for the Fuzzy C-Means Algorithm using Fuzzy Subtractive Clustering, In Proc. of the 2011 International Conference on Information and Knowledge Engineering, pp. 144-150, Las Vegas, Nevada, USA. [147] Le T. and Gardiner J.K. (2011) A validation method for fuzzy clustering of gene expression data, In Proc. of the 2011 International Conference on Bioinformatics & Computational Biology, Vol. I, pp. 23-29, Las Vegas, Nevada, USA. [148] Bodjanova S. (2004) Linear intensification of probabilistic fuzzy partitions, Fuzzy Sets and Systems, Vol. 141, pp. 319-332. [149] Chiu W.Y. and Couloigner I. (2006) Modified fuzzy c-means classification technique for mapping vague wetlands using Landsat ETM+ imagery, Hydrol. Process., Vol. 20, pp. 3623-3634. 233 [150] Chuang K.S., TZeng H.L., Chen S., Wu J. and Chen TJ. (2006) Fuzzy c-means clustering with spatial information for image segmentation, Computerized Medical Imaging and Graphics, Vol. 30, pp. 9-15. [151] Genther H., Runkler T.A. and Glesner M. (1994) Defuzzification based on fuzzy clustering, Fuzzy Systems, Vol. 3, 1646-1648. [152] Roychowdhury S. and Pedrycz W. (2001) A Survey of Defuzzification Strategies, Intelligent Systems, Vol. 16, pp. 679-695. [153] Yang Y., Huang S. and Rao N. (2008) An Automatic Hybrid Method for Retinal Blood Vessel Extraction, Applied Mathematics and Computer Science, Vol. 18, pp. 399-407. [154] Le T., Altman T and Gardiner J.K. (2012) Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data, Proc. Intl' Conf. on Information Technology-New Generations, pp. 42-47, Las Vegas, NV, USA. [155] Le T., Altman T. and Gardiner J.K. (2012) Density-based imputation method for fuzzy cluster analysis of gene expression microarray data, Proc. Intl' Conf. on Bioinformatics and Computational Biology, pp. 190-195, Las Vegas, NV, USA. [156] Le T., Altman T. and Gardiner J.K. (2012) A Probability Based Defuzzification Method for Fuzzy Cluster Partition, Proc. Intl' Conf. on Artificial Intelligence (WORLDCOMP-ICAI'12), Vol. 2, pp. 1038-1043, Las Vegas, NV, USA. [157] Le T., Altman T. and Gardiner J.K. (2012) A Fuzzy Clustering Method using Genetic Algorithm and Fuzzy Subtractive Clustering, Proc. Intl' Conf. on Information and Knowledge Engineering (WORLDCOMP-IKE'12), pp. 426-432, Las Vegas, NV, USA. [158] Luo J.W., Yang T. and Wang Y. (2006) Missing Value Estimation For Microarray Data Based On Fuzzy C-means Clustering, Proc. Intl' Conf. High-Performance Computing (HPCASIA'05), IEEE Press, Feb. 2006, pp. 11-16. [159] Kim D.W, Lee K.Y., Lee K.H. and Lee D. (2007) Towards clustering of incomplete microarray data without the use of imputation, Bioinformatics, Vol. 23, Jan. 2007, pp. 107-1 13. 234 [160] Mohammadi A. and Saraee M.H (2008) Estimating Missing Value in Microarray Data Using Fuzzy Clustering and Gene Ontology, Proc. Intl' Conf. Bioinformatics and Biomedicine (BIBM'08), IEEE Press, Nov. 2008 pp. 382-385. [161] Mehdizadeh E, Sadi-Nezhad S And Tavakkoli-Moghaddam R (2008) Optimization of Fuzzy Clustering Criteria by a Hybrid Pso and Fuzzy C-Means Clustering Algorithm, Iranian Journal of Fuzzy Systems, Vol. 5, pp. 1-14. [162] Haider A., Pramanik S., and Kar A. (2011) Dynamic Image Segmentation using Fuzzy C-Means based Genetic Algorithm, International Journal of Computer Applications, Vol. 28, pp. 15-20. [163] Ghosh A., Mishra N. S., and Ghosh S. (201 1) Fuzzy clustering algorithms for unsupervised change detection in remote sensing images, Information Sciences, Vol. 181, pp. 699-715. [164] Lianjiang Z., Shouning Q., and Tao D. (2010) Adaptive fuzzy clustering based on genetic algorithm, In Proc. of 2nd conference on advanced computer control, Shenyang China, pp. 79-82. [165] Lin T.C., Huang H.C., Liao B.Y., and Pan J.S. (2007) An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index, International Journal of Computer Sciences and Engineering Systems, Vol. 1, pp. 253-257. [166] Liu Y. and Zhang Y. (2007) Optimizing Parameters of Fuzzy c-Means Clustering Algorithm, In Proc. of the 4th conference on Fuzzy Systems and Knowledge Discovery (FSKD '07), Vol. 1, 2007, pp. 633-638. [167] Sturgeon X., Le T. and Gardiner J.K. (2010) Pathways to Intellectual Disability in Down Syndrome, Tenth Annual Coleman Institute Conference (The Coleman Institute for Cognitive Disabilities, University of Colorado), Westminster, Colorado USA, 10/2010. [168] Sturgeon X., Le T., Ahmed M. and Gardiner J.K. (2012) Pathways to cognitive deficits in Down syndrome, Progress in Brain Research - Elsevier, Vol. 197, pp. 73-100. [169] Le T. and Altman T. (in preparation) RPPR: a rank product statistics with page rank algorithm combination method for expression profiles selection using gene expression signature. [170] Le T., Leach S. and Gardiner J.K. (in preparation) A new approach of Connectivity MAP to discover gene-drug association using gene expression microarray data. 235 [171] Gardiner J.K. (2010) Molecular basis of pharmacotherapies for cognition in Down syndrome, Trends Pharmacol Sci., Vol. 31, pp. 31: 66-73. [172] Dubois D. and Prade M. H. (1988) Possibility theory: an approach to computerized processing of uncertainty, Plenum Press. [173] Dubois D. and Prade M. H. (1980) Fuzzy sets and systems: Theory and applications, Mathematics in Science and Engineering, Accademic Press. [174] Wolkenhauer O. (1998) Possibility theory with applications to data analysis, John Wiley & Sons Inc. [175] Chu C.Y. and Rana T.M. (2006) Translation Repression in Human Cells by MicroRNA-Induced Gene Silencing Requires RCK/p54, PLoS Biololy, Vol. 4, pp. 1 122-1 136. [176] Velusamy K. and Manavalan R. (2011) Performance Analysis of Unsupervised Classification Based on Optimization, Int'l Journal of Computer Applications, Vol. 42, pp. 22-27. [177] Rana S., Jasola S. and Kumar R. (2011) A review on particle swarm optimization algorithms and their applications to data clustering, Artif Intell Rev, Vol. 35, pp. 21 1-222. [178] Xie J., Li K.C. and Bina M. (2004) A Bayesian Insertion/Deletion Algorithm for Distant Protein Motif Searching via Entropy Filtering, Journal of the American Statistical Association, Vol. 99, pp. 409-420. [179] Klir G.J. and Yuan B. (1995) Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall. [180] Le T, Altman T., Vu L. and Gardiner J.K. (in preparation) A cluster validation method using GO for gene expression data analysis. [181] Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat Genet, Vol. 25(1), pp. 25-29. [182] Tari L., Baral C. and Kim S. (2009) Fuzzy c-means clustering with prior biological knowledge, J. Biomed Inform., Vol. 42, pp. 74-81. 236 [183] Xu T., Du L.F., and Zhou Y. (2008) Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data, BMC Bioinformatics, Vol. 9, pp. 472- 481. [184] Wang J.Z., Du Z., Payattakool R., Yu P.S and Chen C.F. (2007) A new method to measure the semantic similarity of GO terms, Bioinformatics, Vol. 23, pp. 1274-1281. [185] Yu G., Li F., Qin Y. et al. (2010) GOSemSim: an r package for measuring semantic similarity among GO terms and gene products, Bioinformatics, Vol. 26, pp. 976-983. [186] Krucker T., Siggins G.R. and Halpain S. (2000) Dynamic actin filaments are required for stable long-term potentiation (LTP) in area CA1 of the hippocampus, PNAS, Vol. 97, pp. 6856-6861 237